Report: Potential NYT lawsuit could force OpenAI to wipe ChatGPT and start over

Good

AI should not be given free reign to train on anything and everything we’ve ever created. Copyright holders should be able to decide if their works are allowed to be used for model training, especially commercial model training. We’re not going to stop a hobbyist, but google/Microsoft/openAI should be paying for materials they’re using and compensating the creators.

While that’s understandable, I think it’s important to recognize that this is something where we’re going to have to treat pretty carefully.
If a human wants to become a writer, we tell them to read. If you want to write science fiction, you should both study the craft of writing ranging from plots and storylines to character development to Stephen King’s advice on avoiding adverbs. You also have to read science fiction so you know what has been done, how the genre handles storytelling, what is allowed versus shunned, and how the genre evolved and where it’s going. The point is not to write exactly like Heinlein (god forbid), but to throw Heinlein into the mix with other classic and contemporary authors.
Likewise, if you want to study fine art, you do so by studying other artists. You learn about composition, perspective, and color by studying works of other artists. You study art history, broken down geographically and by period. You study DaVinci’s subtle use of shading and Mondrian’s bold colors and geometry. Art students will sit in museums for hours reproducing paintings or working from photographs.
Generative AI is similar. Being software (and at a fairly early stage at that), it’s both more naive and in some ways more powerful than human artists. Once trained, it can crank out a hundred paintings or short stories per hour, but some of the people will have 14 fingers and the stories might be formulaic and dull. AI art is always better when glanced at on your phone than when looked at in detail on a big screen.
In both the cases of human learners and generative AI, a neural network(-like) structure is being conditioned to associate weights between concepts, whether it’s how to paint a picture or how to create one by using 1000 words.
A friend of mine who was an attorney used to say “bad facts make bad law.” It means that misinterpretation, over-generalization, politicization, and a sense of urgency can make for both bad legislation and bad court decisions. That’s especially true when the legislators and courts aren’t well educated in the subjects they’re asked to judge.
In a sense, it’s a new technology that we don’t fully understand - and by “we” I’m including the researchers. It’s theoretically and in some ways mechanically grounded in old technology that we also don’t understand - biological neural networks and complex adaptive systems.
We wouldn’t object to a journalism student reading articles online to learn how to write like a reporter, and we rightfully feel anger over the situation of someone like Aaron Swartz. As a scientist, I want my papers read by as many people as possible. I’ve paid thousands of dollars per paper to make sure they’re freely available and not stuck behind a paywall. On the other hand, I was paid while writing those papers. I am not paid for the paper, but writing the paper was part of my job.
I realize that is a case of the copyright holder (me) opening up my work to whoever wants a copy. On the other other hand, we would find it strange if an author forbade their work being read by someone who wants to learn from it, even if they want to learn how to write. We live in a time where technology makes things like DRM possible, which attempts to make it difficult or impossible to create a copy of that work. We live in societies that will send people to prison for copying literal bits of information without a license to do so. You can play a game, and you can make a similar game. You can play a thousand games, and make one that blends different elements of all of them. But if you violate IP, you can be sued.
I think that’s what it comes down to. We need to figure out what constitutes intellectual property and what rights go with it. What constitutes cultural property, and what rights do people have to works made available for reading or viewing? It’s easy to say that a company shouldn’t be able to hack open a paywall to get at WSJ content, but does that also go for people posting open access to Medium?
I don’t have the answers, and I do want people treated fairly. I recognize the tremendous potential for abuse of LLMs in generating viral propaganda, and I recognize that in another generation they may start making a real impact on the economy in terms of dislocating people. I’m not against legislation. I don’t expect the industry to regulate itself, because that’s not how the world works. I’d just like for it to be done deliberately and realistically and with the understanding that we’re not going to get it right and will have to keep tuning the laws as the technology and our understanding continue to evolve.
- Sorry this is a bit too level-headed for me, can you please repeat with a bullhorn, and use 4-letter words instead? I need to know who to blame here.
- This is an astonishingly well written, nuanced, and level headed response. Really on a level I'm not used to seeing on this platform.
- Well written sir.
- Both an AI and an art student are a complex web of weights that take inputs and returns an output. Agreed.
  But the inputs are vastly different. An art student has all the inputs of every moment leading up to the point of putting paint to canvas. Emotion, hunger, pain, and every moment that life has thrown at them. All of them lead to very different results. Every art piece affects the subsequent ones.
  The AI on the other hand is purely derivative. It’s only ever told about pre-existing art and a brief interpretation of it. It does not feel emotion. It does not worry about paying its bills or falling in love. It builds a map of weights once and that is that. Every input repeated however many times will yield exactly the same output.
  And yes, you have the artists who are professional plagiarists, making hand-painted Picasso imitations of someone’s chihuahua for $20 over the internet. But they’re not mass producing derivative work by the thousands.
  I fully agree with the shit-in, shit-out sentiment, and researchers should be free to train their models of whatever data they need.
  But monetising their models, that by definition are generating derivative works is another matter.
Not trying to argue or troll, but I really don't get this take, maybe I'm just naive though.
Like yea, fuck Big Data, but...
Humans do this naturally, we consume data, we copy data, sometimes for profit. When a program does it, people freak out?
edit well fuck me for taking 10 minutes to write my comment, seems this was already said and covered as I was typing mine lol
- It's just a natural extension of the concept that entities have some kind of ownership of their creation and thus some say over how it's used. We already do this for humans and human-based organizations, so why would a program not need to follow the same rules?
- It might be nice if we reserve some things just for humans 🤷🏻‍♂️
No.
A pen manufacturer should not be able to decide what people can and can't write with their pens.
A computer manufacturer should not be able to limit how people use their computers (I know they do - especially on phones and consoles - and seem to want to do this to PCs too now - but they shouldn't).
In that exact same vein, writers should not be able to tell people what they can use the books they purchased for.
.
We 100% need to ensure that automation and AI benefits everyone, not a few select companies. But copyright is totally the wrong mechanism for that.
- A pen is not a creative work. A creative work is much different than something that’s mass produced.
  Nobody is limiting how people can use their pc. This would be regulations targeted at commercial use and monetization.
  Writers can already do that. Commercial licensing is a thing.
- You made two arguments for why they shouldn't be able to train on the work for free and then said that they can with the third?
  Did openai pay for the material? If not, then it's illegal.
  Additionally, copywrite and trademarks and patents are about reproduction, not use.
  If you bought a pen that was patented, then made a copy of the pen and sold it as yours, that's illegal. This is the analogy of what openai is going with books.
  Plagiarism and reproduction of text is the part that is illegal. If you take the "ai" part out, what openai is doing is blatantly illegal.
- All of the examples you listed have nothing to do with how OpenAI was created and set up. It was trained on copyrighted work, how is that remotely comparable to purchasing a pen?
- Computer manufacturers aren't making AI software. If someone uses an HP copier to make illegal copies of a book and then distributes those pages to other people for free, the person that used the copier is breaking the law, not the company that made the copier.
- A pen manufacturer isn't repurposing other peoples' work to make their pens.
  A computer manufacturer has to license the intellectual property that they use to make their computers.
- They didn't pay the writers though, that's the whole point
With that mindset, only the powerful will have access to these models.
Places like Reddit, Google, Facebook, etc, places that can rope you into giving away rights to your data with TOS stipulations.
Locking down everything available on the Internet by piling more bullshit onto already draconian copyright rules isn't the answer and it surprises the shit out of me how quickly fellow artists, writers, and creatives piled onto the side with Disney, the RIAA, and other former enemies the second they started perceiving ML as a threat to their livelihood.
I do believe restrictions should be looked into when it comes to large organizations and industries replacing creators with ML, but attacking open ML models directly is going to result in the common folk losing access to the tools and corporations continuing to work exactly as they are right now by paying for access to locked-down ML based on content from companies who trade in huge amounts of data.
Not to mention it's going to give the giants who have been leveraging their copyright powers against just about everyone on the internet more power to do just that. That's the last thing we need.
What's the basis for this? Why can a human read a thing and base their knowledge on it, but not a machine?
- Because a human understands and transforms the work. The machine runs statistical analysis and regurgitates a mix of what it was given. There’s no understanding or transformation, it’s just what is statistically the 3rd most correct word that comes next. Humans add to the work, LLMs don’t.
  Machines do not learn. LLMs do not “know” anything. They make guesses based on their inputs. The reason they appear to be so right is the scale of data they’re trained on.
  This is going to become a crazy copyright battle that will likely lead to the entirety of copyright law being rewritten.
- That machine is a commercial product. Quite unlike a human being, in essence, purpose and function. So I do not think the comparison is valid here unless it were perhaps a sentient artificial being, free to act of its own accord. But that is not what we’re talking about here. We must not be carried away by our imaginations, these language models are (often proprietary and for profit) products.
I disagree. I think that there should be zero regulation of the datasets as long as the produced content is noticeably derivative, in the same way that humans can produce derivative works using other tools.
- Good in theory, Problem is if your bot is given too mutch exposure to a specific piece of media and when the "creativity" value that adds random noise (and for some setups forces it to improvise) is too low, you get whatever impression the content made on the AI, like an imperfect photocopy (non expert, explained "memorization"). Too high and you get random noise.
- LLM are not human, the process to train LLM is not human-like, LLM don't have human needs or desires, or rights for that matter.
  comparing it to humans has been a flawed analogy since day 1.
Bullshit. If I learn engineering from a textbook, or a website, and then go on to design a cool new widget that makes millions, the copyright holder of the textbook or website should get zero dollars from me.
It should be no different for an AI.
- Agreed. Royalties are a capitalist invention
- Yes, but what about you going into teaching engineering, and writing a text book for it that is awfully close to the ones you have used? Current AI is at a stage where it just "remixes" content it gobbled in, and not (yet) advanced enough to actually learn and derive from it.
- Last time I looked, textbooks were fucking expensive. You might be able to borrow one from the library, of course. But most people who study something pay up front for the information they're studying on
- Every time I see this argument it reminds me of how little people understand how copyright works.
  When you buy that book the monetary amount is fair compensation for the contents inside. What you do afterwards is your own business so long as it does not violate the terms within the fine print of the book (no unauthorized reproductions, etc.)
  When someone is contracted for an ad campaign there will be usage rights in the contract detailing the time frame and scope for fair compensation (the creative fee + expenses). If the campaign does well, they can negotiate residuals (if not already included) because the scope now exceeds the initial offer of fair compensation.
  When you watch a movie on TV, the copyright holder(s) of that movie are given fair compensation for the number of times played. From the copyright holders, every artist is paid a royalty. Jackie Chan and Chris Tucker still get royalty checks whenever Rush Hour 2 airs or is streamed, as do all the other obscure actors and contributing artists.
  Deviant Art and ArtStation provide free hosting for artists in exchange for a license that lets them distribute images to visitors. The artists have agreed to fair compensation in the form of free hosting and potential promotion should their work start trending, reaching all front page visitors of the site. Similarly, when the artists use the printing services of these sites they provide a license to reproduce and ship their works, as fair compensation the sites receive a portion of the artists' asking price.
  The crux is fair compensation. The rights holder has to agree to the usage, with clear terms and conditions for their creative works, in exchange for a monetary sum (single or reoccurring) and/or a service of similar or equal value with a designated party. That's why AI continues to be in hot water. Just because you can suck up the data does not mean the data is public domain. Nor does it mean the license used between interested parties transfers to an AI company during collection. If AI companies want to monetize their services, they're going to have to provide fair compensation for the non-public domain works used.
- Human experience considers context, experience, and relation to previous works
  'AI' has the words verbatim in it's database and will occasionally spit them out verbatim
I think any LLM should be required to be free to use. They can pay for extra bells and whistles like document upload but the core model must be free. They're free to make their billions, but it shouldn't be on a model built by scraping all the information of humanity for free.
- I think this is an even better solution than making them scrap it or pay everyone some token amount.
I understand the sentiment (and agree on moral grounds) but I hink this would put us at an extreme disadvantage in the development of this technology compared to competing nations. Unless you can get all countries to agree and somehow enforce this I think it dramatically hinders our ability to push forward in this space.
- They pay for it, simple.
  Think about a code that an expert Samsung developer wrote and understanding and executing that flawlessly took 20 years of his/her experience. That person is the only one skilled enough to write it but an LLM model stole it and suggesting it every dev around the world.
  That's a good thing if the dev gets paid to teach the model and then we pay to subscribe to it. Right now it's breaking the economy. Organisations and startups are abusing the knowledge and laying off skilled occupation.
- Open sourcing the models does absolutely nothing. The fact of the matter is that the people who create these models aren’t able to quantifiably show how they work, because those levels have been abstracted so far into code that there’s no way to understand them.
You sound like an old man who’s scared of changing times.
- Or a creative who hates to see the entire soul of the human race boiled down to a computer doing a whole lot of math.
  AI isn’t going to put office workers out of a job, not just yet, but it’s sure going to end the careers of a whole lot of artists who won’t get entry level opportunities anymore because an AI is able to do 90% of the job and all they need is someone to sort the outputs.
Yeah! Let’s burn fair use to the ground! Technology is scary! Destroy it all!
- I don't think AI is criticising or parodying that content. Also ChatGPT is a glorified chatbot that can just make it's answers seem human, it's not some world saving technology.

When OpenAI commits copyright infringement no one bats an eye, but when I do it everyone downvotes me

Yeah I don't get it. ChatGPT is not "Fair use" and there is no credit given to anyone, it's a solid case against them
- I just wonder if they'll get out of it because LLMs do reword the information instead of spitting it back out verbatim. It's the same reason I think the image generators are safe from copyright law - it's just different enough that they could plausibly convince a judge with a fair use argument.
  What bothers me even more is all the text they had to scrape to create ChatGPT... That seems like a novel problem for the legal system because you know there's no way they paid for all of it.
- I'm not 100% sure where I stand but, for arguments sake; Are you sure about that? it sure is transformative!
- It’s wishful thinking on your part. Every AI model in existence, from computer vision to the photo adjustments in your phone camera was trained this way.
  The only reason there’s a stink now is that certain lobbies suddenly lose their job as opposed to blue collar workers.
  But there’s more than a decade of precedent now to fall back on and not one legal case to show that it’s not fair use.
  So would you kindly cite the case decisions that back up your assertion? Or are you just hallucinating like an LLM because you want a certain outcome to be true? Geez, I wonder where the technology learned that.
Classic joke, something like: if you owe the bank $100, it's your problem; if you owe them a million, it's their problem.
Only on lemmy.world
Society moment.

I'll take things that won't happen for $200

Could this headline possibly have any more weasel word qualifiers? Lots of things “could” happen.

Modern media is full of it so they can’t be sued for anything. And also because there’s probably little fact in at least a few parts of the article so they’re posting it as “speculative”
Also, “start over” is a bit dramatic. I think you mean hit up-enter to rerun the training.
- "Hit up-enter" is such a hilarious way of describing the unbelievably vast cost of retraining GPT-4

Too many people have copies of the full database at this point for such a thing to mean much. And there's too many other versions that have been produced since to matter either.

Oh God, I better not learn anything from a book or I'm fucked.

Fuck man I’ve watched sooooo many movies… the MPAA is gonna be on my ass…

can't they just move to a country where copyright doesn't exist?

They could, but presumably they want to make business and sell their products in countries that do have those copyright protections or to other companies from there.
- sounds like some bullshit to me
Make another "russian bot" perhaps. /s Honestly, russia is good for getting away with piriting things, Because Corruption.

This is the best summary I could come up with:

The result, experts speculate, could be devastating to OpenAI, including the destruction of ChatGPT's dataset and fines up to $150,000 per infringing piece of content.

If the Times were to follow through and sue ChatGPT-maker OpenAI, NPR suggested that the lawsuit could become "the most high-profile" legal battle yet over copyright protection since ChatGPT's explosively popular launch.

This speculation comes a month after Sarah Silverman joined other popular authors suing OpenAI over similar concerns, seeking to protect the copyright of their books.

As of this month, the Times' TOS prohibits any use of its content for "the development of any software program, including, but not limited to, training a machine learning or artificial intelligence (AI) system."

In the memo, the Times' chief product officer, Alex Hardiman, and deputy managing editor Sam Dolnick said a top "fear" for the company was "protecting our rights" against generative AI tools.

the memo asked, echoing a question being raised in newsrooms that are beginning to weigh the benefits and risks of generative AI.

I'm a bot and I'm open source!

Ironic…

Kill the earth, and start over.

Isn't that... weird? Like someone would have reinvented the wheel (literally), to end up being much better than the "original" wheel... and pissing off mechanics worldwide because that'd make em lose their jobs because of some random, nonsensical reason. Idk, its like folks forgot that the AI can be EASILY manipulated. But nooooooooo "whining is easier, lads! Let's whine instead of making things better!".... heh. Humanity is so fucked.

Surprising that there are so many copyright bootlickers on lemmy

My bad, I’ll just let big corporations train on every piece of my personal creative output as “Fair Use” so they can sell it for profit. That’ll show ‘em!
Satire aside, you don’t seem to understand what copyright is or are confusing it with laissez-faire capitalism. You can’t bootlick rights.
Brb, going to go bootlick the European Convention on Human Rights.
You're either a copyright bootlicker or an AI bootlicker in this thread, there's no middle ground.

Start over with what? The internet is now polluted with AI gibberish.

AI training AI can't end well.

i guess that bot's sweatin' tonight.. maybe the stress is what's causing the errors..

If ChatGPT is removed, it will really have a big impact on users getting used to its convenience, when there are currently too many ChatGPT Online causing it to explode.

If ChatGPT gets erased and rescanned with new stuff, I can see different "flavors" popping up to replace it. For example: Conservative GPT, no content related to trans, evolution, or climate change. Vetted Wikipedia entries only.

That probably already exists, why would one of the many LLM systems out there having to re-train impact anything but that one program?

LAME. Butthurt people is all i see. People like that are what cause huge roadblocks in advancing humans. I wish someone made a law that made it impossible to force someone to stop doing something simply because they are using ai to train and it hurts their pockets. Let AI learn what we've created. The possibilities are endless. So many good things can come from it but all everyone wants to do is look at the negatives. Of which a massive majority are easily solvable.

One tiny example is think of what ai can do in a few years time with training on medicine alone. For all we know the ai can figure out cancer for us. But with everyone clamping down on this shit we'll never know.

Leave it to humans to ruin great things. If aliens ever visit this planet I'm ratting out every last human against technology 🤣

OpenAI has a valuation of $29B. They could easily afford to license the entirety of the NTY content but did not do so.