"Copyright today covers virtually every sort of human expression" and cannot be avoided.
Apparently, stealing other people's work to create product for money is now "fair use" as according to OpenAI because they are "innovating" (stealing). Yeah. Move fast and break things, huh?
"Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials," wrote OpenAI in the House of Lords submission.
OpenAI claimed that the authors in that lawsuit "misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence."
Try to train a human comedian to make jokes without ever allowing him to hear another comedian's jokes, never watching a movie, never reading a book or magazine, never watching a TV show. I expect the jokes would be pretty weak.
If OpenAI stated that it is impossible to train leading AI models without using copyrighted material, then, unpopular as it may be, the preemptive pragmatic solution should be pretty obvious, enter into commercial arrangements for access to said copyrighted material.
Claiming a failure to do so in circumstances where the subsequent commercial product directly competes in a market seems disingenuous at best, given what I assume is the purpose of copyrighted material, that being to set the terms under which public facing material can be used. Particularly if regurgitation of copyrighted material seems to exist in products inadequately developed to prevent such a simple and foreseeable situation.
Yes I am aware of the USA concept of fair use, but the test of that should be manifestly reciprocal, for example would Meta allow what it did to MySpace, hack and allow easy user transfer, or Google with scraping Youtube.
To me it seems Big Tech wants its cake and to eat it, where investor $$$ are used to corrupt open markets and undermine both fundamental democratic State social institutions, manipulate legal processes, and undermine basic consumer rights.
The absolute hubris required for OpenAI here to come right out and say, 'Yeah, we have no choice but to build our product off the exploitation of the work others have already performed' is stunning. It's about as perfect a representation of the tech bro mindset that there can ever be. They didn't even try to approach content creators in order to do this, they just took what they needed because they wanted to. I really don't think it's hyperbolic to compare this to modern day colonization, or worker exploitation. 'You've been working pretty hard for a very long time to create and host content, pay for the development of that content, and build your business off of that, but we need it to make money for this thing we're building, so we're just going to fucking take it and do what we need to do.'
The entitlement is just...it's incredible.
4qu4rius
20 years ago, high school kids were sued for millions & years in jail for downloading a single Metalica album (if I remember correctly minimum damage in the US was something like 500k$ per song).
All of a sudden, just because they are the dominant ones doing the infringment, they should be allowed to scrap the entire (digital) human knowledge ? Funny (or not) how the law always benefits the rich.
What's stopping AI companies from paying royalties to artists they ripped off?
Also, lol at accounts created within few hours just to reply in this thread.
The moment their works are the one that got stolen by big companies and driven out of business, watch their tune change.
Edit: I remember when Reddit did that shitshow, and all the sudden a lot of sock / bot accounts appeared. I wasn't expecting it to happen here, but I guess election cycle is near.
This is not REALLY about copyright - this is an attack on free and open AI models, which would be IMPOSSIBLE if copyright was extended to cover the case of using the works for training.
It's not stealing. There is literally no resemblance between the training works and the model. IP rights have been continuously strengthened due to lobbying over the last century and are already absurdly strong, I don't understand why people on here want so much to strengthen them ever further.
Any reasonable person can reach the conclusion that something is wrong here.
What I'm not seeing a lot of acknowledgement of is who really gets hurt by copyright infringement under the current U.S. scheme. (The quote is obviously directed toward the UK, but I'm reasonably certain a similar situation exists there.)
Hint: It's rarely the creators, who usually get paid once while their work continues to make money for others.
Let's say the New York Times wins its lawsuit. Do you really think the reporters who wrote the infringed-upon material will be getting royalty checks to be made whole?
This is not OpenAI vs creatives. OK, on a basic level it is, but expecting no one to scrape blogs and forum posts rather goes against the idea of the open internet in the first place. We've all learned by now that what goes on the internet stays there, with attribution totally optional unless you have a legal department. What's novel here is the scale of scraping, but I see some merit to the "transformational" fair-use defense given that the ingested content is not being reposted verbatim.
This is corporations vs corporations. Framing it as millions of people missing out on what they'd have otherwise rightfully gotten is disingenuous.
Well in that case maybe chat gpt should just fuck off it doesn’t seem to be doing anything particularly useful, and now it’s creator has admitted it doesn’t work without stealing things to feed it. Un fucking believable. Hacks gonna hack I guess.
It's also "impossible" to have multiple terabytes of media on my homeserver without copyright infringement, so piracy is ok, right!?
O no, wait it actually is possible, it's just more expensive and more work to do it legally (and leaves a lot of plastic trash in form of Blurays and DVDs), just like with AI. But laws are just for poor people, I guess.
Alas, AI critics jumped onto the conclusion this one time. Read this:
Further, OpenAI writes that limiting training data to public domain books and drawings "created more than a century ago" would not provide AI systems that "meet the needs of today's citizens."
It's a plain fact. It does not say we have to train AI without paying.
To give you a context, virtually everything on the web is copyrighted, from reddit comments to blog articles to open source software. Even open data usually come with copyright notice. Open research articles also.
If misled politicians write a law banning the use of copyrighted materials, that'll kill all AI developments in the democratic countries. What will happen is that AI development will be led by dictatorships, and that's absolutely a disaster even for the critics. Think about it. Do we really want Xi, Putin, Netanyahu and Bin Salman to control all the next-gen AIs powering their cyber warfare while the West has to fight them with Siri and Alexa?
So, I agree that, at the end of the day, we'd have to ask how much rule-abiding AI companies should pay for copyrighted materials, and that'd be less than the copyright holders would want. (And I think it's sad.)
However, you can't equate these particular statements in this article to a declaration of fuck-copyright. Tbh Ars Technica disappointed me this time.
I think viral outrage aside, there is a very open question about what constitutes fair use in this application. And I think the viral outrage misunderstands the consequences of enforcing the notion that you can't use openly scrapable online data to build ML models.
Effectively what the copyright argument does here is make it so that ML models are only legally allowed to make by Meta, Google, Microsoft and maybe a couple of other companies. OpenAI can say whatever, I'm not concerned about them, but I am concerned about open source alternatives getting priced out of that market. I am also concerned about what it does to previously available APIs, as we've seen with Twitter and Reddit.
I get that it's fashionable to hate on these things, and it's fashionable to repeat the bit of misinformation about models being a copy or a collage of training data, but there are ramifications here people aren't talking about and I fear we're going to the worst possible future on this, where AI models are effectively ubiquitous but legally limited to major data brokers who added clauses to own AI training rights from their billions of users.
The real issue is money. How much and how (un)distributed.
Why is it fair/ok that one company can use all this material and make a lot of money off it without paying or even acknowledging others work?
On the flip side AI model could be useful. Maybe the models/weights should be made free just like the content they are trained on. Instead of paying for the model, we should pay for the hosting of the inference (aka. the API)
All the AI race has done is surface the long standing issue of how broken copyright is for the online internet era. Artists should be compensated but trying to do that using the traditional model which was originally designed with physical, non infinitely copyable goods in mind is just asinine.
One such model could be to make the copyright owner automatically assigned by first upload on any platform that supports the API. An API provided and enforced by the US copyright office. A percentage of the end use case can be paid back as royalties. I haven't really thought out this model much further than this.
Machine learning is here to say and is a useful tool that can be used for good and evil things alike.
Wait, so if the way I make money is illegal now, it's the system's fault, isn't it? That means I can keep going because I believe I'm justified, right? Right?
Of course it is. About 50 years ago we went to a regime where everything is copywrited rather then just things that were marked and registered. Not sure where.I stand on that. One could argue we are in a crazy over copyright era now anyway.
There's not a musician that havent heard other songs before. Not a painter that haven't seen other painting. No comedian that haven't heard jokes. No writer that haven't read books.
AI haters are not applying the same standards to humans that they do to generative AI. Obviously this is not to say that AI can't plagiarize. If it's spitting out sentences that are direct quotes from an article someone wrote before and doesn't disclose the source then yeah that is an issue. There's however a limit after which the output differs enough from the input that you can't claim it's stealing even if perfectly mimics the style of someone else.
Just because DallE creates pictures that have getty images watermark on them it doesn't mean the picture itself is a direct copy from their database. If anything it's the use of the logo that's the issue. Not the picture.
OpenAI now needs to go to court and argue fair use forever. That's the burden of our system. Private ownership is valued higher than anything else so ... Good luck we're all counting on you (unfortunately).
🤖 I'm a bot that provides automatic summaries for articles:
Click here to see the summary
Further, OpenAI writes that limiting training data to public domain books and drawings "created more than a century ago" would not provide AI systems that "meet the needs of today's citizens."
OpenAI responded to the lawsuit on its website on Monday, claiming that the suit lacks merit and affirming its support for journalism and partnerships with news organizations.
OpenAI's defense largely rests on the legal principle of fair use, which permits limited use of copyrighted content without the owner's permission under specific circumstances.
"Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents," OpenAI wrote in its Monday blog post.
In August, we reported on a similar situation in which OpenAI defended its use of publicly available materials as fair use in response to a copyright lawsuit involving comedian Sarah Silverman.
OpenAI claimed that the authors in that lawsuit "misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence."
The amount of second hand content an LLM needs to consume to train inevitably includes copyrighted material. If they used this thread, the quotes OP included would end up in the training set.
The amount of fan forums and wikis on copy written material provide copious amounts of information about the stories and facilitate the retelling. They're right that it is impossible for a general purpose LLM.
My personal experience so far though has been that general purpose and multiple modality LLMs are less consistently useful to me than GPT4 was at launch. I think small, purpose built LLMs with trusted content providers have a better chance of success for most users, but we will see if anyone can make that work given the challenge of bringing users to the right one for the right task.
I would just like to say, with open curiosity, that I think a nice solution would be for OpenAI to become a nonprofit with clear guidelines to follow.
What does that make me? Other than an idiot.
Of that at least, I’m self aware.
I feel like we’re disregarding the significance of artificial intelligence’s existence in our future, because the only thing anybody that cares is trying to do is get back control to DO something about it. But news is becoming our feeding tube for the masses. They’ve masked that with the hate of all of us.