Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train ChatGPT

lemmyreader@lemmy.ml · 1 year ago

Stack Overflow bans users en masse for rebelling against OpenAI partnership — users banned for deleting answers to prevent them being used to train ChatGPT

Skull giver@popplesburger.hilciferous.nl · 1 year ago

AI companies are hoping for a ruling that says content generated from a model trained on content is not a derivative work. So far, the Sarah Silverman lawsuit seems to be going that way, at least; the claimants were set back because they’ve been asked to prove the connection between AI output and their specific inputs.

If this does become jurisprudence or law in one or more countries, licenses don’t mean jack. You can put the AGPL on your stuff and AI could suck it up into their model and use it for whatever they want, and you couldn’t do anything about it.

The AI training sets for all common models contains copyright works like entire books, movies, and websites. Don’t forget that most websites don’t even have a license, and that that unlicensed work is as illegal to replicate as any book or movie normally would be, including internet comments. If AI data sets need to comply with copyright, all current AI will need to be retrained (except maybe for that image AI by that stock photo company, which is exclusively trained on licensed work).

verassol@lemmy.ml · 1 year ago

the claimants were set back because they’ve been asked to prove the connection between AI output and their specific inputs

I mean, how do you do that for a closed-source model with secretive training data? As far as I know, OpenAI has admitted to using large amounts of copyrighted content, numberless books, newspaper material, all on the basis of fair use claims. Guess it would take a government entity actively going after them at this point.

Skull giver@popplesburger.hilciferous.nl · 1 year ago

The training data set isn’t the problem. The data set for many open models is actually not hard to find, and it’s quite obvious that works by the artists were included in the data set. In this case, the lawsuit was about the Stable Diffusion dataset, and I believe that’s just freely available (though you may need to scrape and download the linked images yourself).

For research purposes, this was never a problem: scientific research is exempted from many limitations of copyright. This led to an interesting problem with OpenAI and the other AI companies: they took their research models, the output of research, and turned them into a business.

The way things are going, I expect the law to be like this: datasets can contain copyrighted work as long as they’re only distributed for research purposes, AI models are derivative works, and the output of AI models is not a derivative work, and therefore the output AI companies generate is exempt of copyright. It’s definitely not what I want to happen, but the legal arguments that I thought would kill this interpretation don’t seem to hold water in court.

Of course, courts only apply law as it is written right now. At any point in time, governments can alter their copyright laws to kill or clear AI models. On the one hand, copyright lobbyists have a huge impact on governance, as much as big oil it seems, but on the other hand, banning AI will just put countries that don’t care about copyright to get an economic advantage. The EU has set up AI rules, which I appreciate as an EU citizen, but I cannot deny that this will inevitably lead to a worse environment to do business in compared to places like the USA and China.

verassol@lemmy.ml · 1 year ago

Thank you for sharing. Your perspective broadens mine, but I feel a lot more negative about the whole “must benefit business” side of things. It is fruitless to hold any entity whatsoever accountable when a whole worldwide economy is in a free-for-all nuke-waving doom-embracing realpolitik vibe.

Frankly, not sure what would be worse, economic collapse and the consequences to the people, or economic prosperity and… the consequences to the people. Long term, and from a country that is not exactly thriving in the scheme side of things, I guess I’d take the former.

Skull giver@popplesburger.hilciferous.nl · 1 year ago

It’s a tough balance, for sure. I don’t want AI companies to exist in the form they currently are, but we’re not getting the genie back into the bottle. Whether the economic hit is worth the freedom and creative rights, that I think citizens deserve, is a matter of democratic choice. It’s impossible to ignore the fact that in China or Russia, where citizens don’t have much a choice, I don’t think artistic rights or the people’s wellbeing are even part of the equation. Other countries will need a response when companies from these countries start doing work more efficiently. I myself have been using Bing AI more and more as AI bullcrap is flooding every page of every search engine, fighting AI with AI so to speak.

I saw this whole ordeal coming the moment ChatGPT came out and I had the foolish hope that legislators would’ve done something by now. The EU’s AI Act will apply March next year but it doesn’t seem to solve the copyright problem at all. Or rather, it seems to accept the current copyright problem, as the EU’s summary put it:

Generative AI, like ChatGPT, will not be classified as high-risk, but will have to comply with transparency requirements and EU copyright law:

Disclosing that the content was generated by AI

Designing the model to prevent it from generating illegal content

Publishing summaries of copyrighted data used for training

The EU seems to have chosen to focus on combating the immediate threat of AI abuse, but seem to be very tolerant of AI copyright infringement. I can only presume this is to make sure “innovation” doesn’t get impeded too much.

I’ll take this into account during the EU vote that’s about to happen soon, but I’m afraid it’s too late. I wish we could go back and stop AI before it started, but this stuff has happened and now the world is a little bit better and worse.

bitfucker@programming.dev · 1 year ago

Yep. Can’t wait to overfit LLM to a lot of copyrighted work and share it to public domain. Let’s see if OpenAI will get push back from copyright owner down the road.