Yeah! I can't make money running my restaurant if I have to pay for the ingredients, so I should be allowed to steal them. How else can I make money??
Alternatively:
OpenAI is no different from pirate streaming sites in this regard (loosely: streaming sites are way more useful to humanity). If OpenAI gets a pass, so should every site that's been shut down for piracy.
"Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today's citizens."
Some idea for others: If OpenAI wins, then use this case when you get busted for sellling bootleg Blu-Rays (since DVDs are long obsolete) from your truck.
I stand by my opinion that learning systems training on copyrighted materials isn't the problem, it's companies super eager to replace human workers with automation (or replace skilled workers with cheaper, unskilled workers). The problem is, every worker not working is another adult (and maybe some kids) not eating and not paying rent.
(And for those of you soulless capitalists out there, people without food and shelter is bad. That's a thing we won't tolerate and start looking at you lean-and-hungry-like when it happens. That's what gets us thinking about guillotines hungry for aristocrats.)
In my ideal world, everyone would have food, shelter, clothes, entertainment and a general middle-class lifestyle whether they worked or not, and intellectual-property temporary monopolies would be very short and we'd have a huge public domain. I think the UN wants to be on the same page as me, but the United States and billionaires do not.
All we'd have to worry about is the power demands of AI and cryptomining, which might motivate us to get pure-hydrogen fusion working. Or just keep developing solar, wind, geothermal and tidal power until everyone can run their AC and supercomputer.
So... they are a non-profit (as they initially were) or a public research lab then. That would perfectly fine to say the path that they chose and so happen to make them unbelievably rich, is not viable.
They don't have a business if they can't legally make profit, it's not that hard. I'm sure people who are pursing superhuman intelligence can figure out that much, if not they can ask their "AI" some help to understand.
I don’t mind him using copyrighted materials as long as it leads to OpenAI becoming truly open source. Humans can replicate anything found in the wild with minor variations, so AI should have the same access. This is how human creativity builds upon itself. Why limit AI? We already know all the jobs people have will be replaced anyway eventually.
Cool, so if openAI can do it, that means piracy is legal?
How about we just drastically limit copyright length to something much more reasonable, like the original 14 year duration w/ an optional one-time renewal for another 14 years.That should give AI companies a large corpus to train an AI with, while also protecting recent works from abuse. Perhaps we can round down to 10 years instead, which should still be more than enough for copyright holders to establish their brand on the market.
I think copyright has value, but I don't think it has as much value as we're giving it.
On point 4 they are specifically responding to an inquiry about the feasibility of training models on public domain only and they are basically saying that an LLM trained on only that dataset would be shit. But their argument isn't "you should allow it because we couldn't make money otherwise" their actual argument is more "training LLM with copyrighted material doesn't violate current copyright laws" and further if we changed the law to forbid that it would cripple all LLMs.
On the one hand I think most would agree the current copyright laws are a bit OP anyway - more stuff should probably become public domain much earlier for instance - but most of the world probably also doesn't think training LLMs should be completely free from copyright restrictions without being opensource etc. But either way this articles title was absolute shit.
For years Microsoft and Google were happy to acquiesce to copyright claims from the music and movie industry. Now all of a sudden when it benefits them to break those same laws, they immediately did. And now those industries who served small creators copyright claims are up against someone with a bigger legal budget.
It's more evident then ever how broken our copyright system is. I'm hoping this blows up in both parties faces and we finally get some reform but I'm not holding my breath.
This is an assumption but I bet all the data feed into Content ID on YouTube was used to train Bard/Gemini....
In a way this thread is heart-warming. There are so many different people here - liberals, socialists, anarchists, communists, progressives, ... - and yet they can all agree on 1 fundamental ethical principle: The absolute sanctity of intellectual property.
"because it's supposedly "impossible" for the company to train its artificial intelligence models — and continue growing its multi-billion-dollar-business — without them."
O no! Poor richs cant get more rich fast enough :(
Copyright is a pain in the ass, but Sam Altman is a bigger pain in the ass. Send him to prison and let him rot. Then put his tears in a cup and I'll drink them
The gall of these motherfuckers is truly astonishing. To be either so incredibly out of touch, or so absolutely shameless, makes me wanna call up every single school bully I ever endured to get their very best bullying tips
What irks me most about this claim from OpenAI and others in the AI industry is that it's not based on any real evidence. Nobody has tested the counterfactual approach he claims wouldn't work, yet the experiments that came closest--the first StarCoder LLM and the CommonCanvas text-to-image model--suggest that, in fact, it would have been possible to produce something very nearly as useful, and in some ways better, with a more restrained training data curation approach than scraping outbound Reddit links.
All that aside, copyright clearly isn't the right framework for understanding why what OpenAI does bothers people so much. It's really about "data dignity", which is a relatively new moral principle not yet protected by any single law. Most people feel that they should have control over what data is gathered about their activities online, as well as what is done with those data after it's been collected, and even if they publish or post something under a Creative Commons license that permits derived uses of their work, they'll still get upset if it's used as an input to machine learning. This is true even if the generative models thereby created are not created for commercial reasons, but only for personal or educational purposes that clearly constitute fair use. I'm not saying that OpenAI's use of copyrighted work is fair, I'm just saying that even in cases where the use is clearly fair, there's still a perceived moral injury, so I don't think it's wise to lean too heavily on copyright law if we want to find a path forward that feels just.
Honestly, copyright is shit. It is created on the basis of an old way of doing things. That is, where big editors and big studios make mass productions of physical copies of a said 'product'. George R. R. Martin , Warner Studios & co are rich. Maybe they have everything to lose without their copy'right' but that isn't the population's problem. We live in an era where everything is digital and easily copiable and we might as well start acting like it.
I don't care if Sam Altman is evil, this discussion is fundamental.
Sorry not sorry. Found another company that does not need to rob people and other companies to make money. Also: breaking the law should make this kind of people face grim consequences. But nothing will happen.
The internet has been primarily derivative content for a long time. As much as some haven't wanted to admit it. It's true. These fancy algorithms now take it to the exponential factor.
Original content had already become sparsely seen anymore as monetization ramped up. And then this generation of AI algorithms arrived.
The several years before prior to LLMs becoming a thing, the internet was basically just regurgitating data from API calls or scraping someone else's content and representing it in your own way.
Unregulated areas lead to these type of business practices where the people will squeeze out the juices of these opportunities. The cost of these activities will be passed on the taxpayers.
As written the headline is pretty bad, but it seems their argument is that they should be able to train from publicly available copywritten information, like blog posts and social media, and not from private copywritten information like movies or books.
You can certainly argue that "downloading public copywritten information for the purposes of model training" should be treated differently from "downloading public copywritten information for the intended use of the copyright holder", but it feels disingenuous to put this comment itself, to which someone has a copyright, into the same category as something not shared publicly like a paid article or a book.
Personally, I think it's a lot like search engines. If you make something public someone can analyze it, link to it, or derivative actions, but they can't copy it and share the copy with others.
The more important point is that social media companies can claim to OWN all the content needed to train AI. Same for image sites. That means they get to own the AI models. That means the models will never be free. Which means they control the "means of generation". That means that forever and ever and ever most human labour will be worth nothing while we can't even legally use this power. Double fucked.
YOU the user/product will not gain anything with this copyright strongmanning.
And to the argument itself: Just because AI is better at learning from existing works, faster, more complete, better memory, doesn't meant that it's fundamentally different than humans learning from artwork. Almost EVERY artist arguing for this is stealing themselves since they learned and was inspired by existing works.
But I guess the worst possible outcome is inevitable now.
Right now, you can draw the line easily. There will come a time, not to far in the future where machines reading and summarizing copy written data will be the norm.
It's doesnt have to change yet, but eventually this will have to be properly handled.
We're all just horse owners bitching about how cars will just have to be stopped.
No, they can make money without stealing. They just choose to steal and lie about it either way. It's the worst kind of justification.
The investors are predominantly made up of the Rationalist Society. It doesn't matter whether or not AI "makes money". It matters that the development is steered as quickly as possible towards an end product of producing as much propaganda as possible.
The bottom line barely even matters in the bigger picture. If you're paying someone to make propaganda, and the best way to do that is to steal from the masses, then they'll do it regardless of whether or not the business model is "profitable" or not.
The lines drawn for AI are drawn by people who want to use it for misinformation and control. The justifications make it seem like the lines were drawn around a monetary system. No, that's wrong.
Who cares about profitability when people are paying you under the table to run a mass crime ring.
This is the main issue with AI. It is the issue with AI that should have been handled and ultimately regulated before any AI tool got to its current state. This is also a reason why we really cannot remove the A from STEAM.
Wow, I just chatted with a coworker about AI, and I told them it was crazy how it uses copyrighted content to create something supposedly “new,” and they said “well how would we train the AI without it?” I don’t think we should sacrifice copyright laws and originality for the sake of improving profits as they tell us it’s only to “improve the experience.”
Isn't copyright about the right to make and distribute or sell copies or the lack there of? As long as they can prevent jailbreaking the AI, reading copyrighted material and learning from it to produce something else is not a copyright violation.
These people are supposedly the smart people in our society. The leaders of industry, but they whine and complain when they are told not to cheat or break the law.
If y'all are so smart, then figure out a different way of creating an A.I. Maybe the large language model, or whatever, isn't the approach you should use. 🤦♂️
Yeah, but because our government views technological dominance as a National Security issue we can be sure that this will come to nothing bc China Bad™.
I can already tell this is going to be a unpopular opinion judging by the comments but this is my ideology on it
it's totally true. I'm indifferent on it, if it was acquired by a public facing source I don't really care, but like im definitly against using data dumps or data that wasn't available to the public in the first place. The whole thing with AI is rediculous, it's the same as someone going to a website and making a mirror, or a reporter making an article that talks about what's in it, last three web search based AI's even gave sources for where it got the info. I don't get the argument.
if it's image based AI, well it's the equivalent to an artist going to an art museum and deciding they want to replicate the art style seen in a painting. Maybe they shouldn't be in a publishing field if they don't want their work seen/used. That's my ideology on it it's not like the AI is taking a one-to-one copy and selling the artwork as , which in my opinion is a much more harmful instance and already happens commonly in today's art world, it's analyzing existing artwork which was available through the same means that everyone else had of going online loading up images and scraping the data. By this logic, artist should not be allowed to enter any art based websites museums or galleries, since by looking at others are they are able to adjust their own art which is stealing the author's work. I'm not for or against it but, the ideology is insane to me.
Those claiming AI training on copyrighted works is "theft" are misunderstanding key aspects of copyright law and AI technology. Copyright protects specific expressions of ideas, not the ideas themselves. When AI systems ingest copyrighted works, they're extracting general patterns and concepts - the "Bob Dylan-ness" or "Hemingway-ness" - not copying specific text or images.
This process is more akin to how humans learn by reading widely and absorbing styles and techniques, rather than memorizing and reproducing exact passages. The AI discards the original text, keeping only abstract representations in "vector space". When generating new content, the AI isn't recreating copyrighted works, but producing new expressions inspired by the concepts it's learned.
This is fundamentally different from copying a book or song. It's more like the long-standing artistic tradition of being influenced by others' work. The law has always recognized that ideas themselves can't be owned - only particular expressions of them.
Moreover, there's precedent for this kind of use being considered "transformative" and thus fair use. The Google Books project, which scanned millions of books to create a searchable index, was found to be legal despite protests from authors and publishers. AI training is arguably even more transformative.
While it's understandable that creators feel uneasy about this new technology, labeling it "theft" is both legally and technically inaccurate. We may need new ways to support and compensate creators in the AI age, but that doesn't make the current use of copyrighted works for AI training illegal or unethical.