Sarah Silverman and other authors are suing OpenAI and Meta for copyright infringement, alleging that they're training their LLMs on books via Library Genesis and Z-Library
I think this is going to raise some questions about fair use, since AI projects are absolutely a derivative works that are sufficiently removed from the content they used. (There may be some argument that it's also educational use.)
This case may rekindle questions about fair use given that our current copyright-maximalist clime has been less interested in enforcing fair use and more interested in enforcing copyright regardless of fair use.
The complaint lays out in steps why the plaintiffs believe the datasets have illicit origins — in a Meta paper detailing LLaMA, the company points to sources for its training datasets, one of which is called ThePile, which was assembled by a company called EleutherAI. ThePile, the complaint points out, was described in an EleutherAI paper as being put together from “a copy of the contents of the Bibliotik private tracker.” Bibliotik and the other “shadow libraries” listed, says the lawsuit, are “flagrantly illegal.”
I used to have a Bibliotik account, and if this is true about ThePile, they very likely have at least the beginnings of a successful case.
I apologize for the inconvenience, but as an AI language model, I don't have direct access to books or copyrighted materials like "The Bedwetter" by Sarah Silverman.
Pack it up, guys!
On a serious note corporations abusing authors' copyrighted work is on an entire different level to civilian piracy and I hope they get seriously shafted over it. Same thing for Bing and Bard. All of chatGPT is built on dubious or outright illegal datasets and there is no reason huge multinationals shouldn't at least pay and inform the authors of those works. But in reality the blame will probably be shifted to the libraries.