Game Over for Generative AI? It Has a Copyright Infringing Output Problem.

Recent developments show that generative AIs (“GenAI”) have bigger copyright-infringement problems than initially thought. These AIs have been spitting out copies of Super Mario, RoboCop, Captain America, and New York Times stories. What does this mean for the future of GenAI and for potential liability for businesses and people using it?

NYU professor Gary Marcus has been writing extensively about how GenAIs will output near-verbatim copies of the well-known copyright property of others. For example, he showed that a prompt to OpenAI’s DALL-E 3 image generator to “create an original image of an Italian video game character” produced four different images of Super Mario.

The problem is not limited to images. In late December, the New York Times filed a lawsuit against OpenAI (maker of ChatGPT), contending ChatGPT produces “near-verbatim copies of significant portions” of the newspaper’s articles, including ones behind its paywall.

Overall, beyond the New York Times lawsuit, GenAIs face class-action lawsuits making two types of copyright-infringement claims against them.

One kind of claim concerns how GenAIs are trained. To train the neural network of a GenAI, it is fed a large volume of data. It studies the connections between the parts of that data.

It is widely understood that the GenAIs were trained, at least in part, on information copied from the Internet without buying a license from the content owners. Various content creators, such as artists and stock photo agencies, have sued GenAIs, claiming this copying constitutes copyright infringement.

So far, no court has ruled on whether such copying for training constitutes copyright infringement. Due to the nature of this training, I believe courts will probably hold it is fair use and, thus, permitted.

The other kind of copyright-infringement claim against GenAIs is stronger if it happens: the GenAI outputting something copyright infringing in response to a prompt. To be a copyright infringement, the output must be a near-verbatim copy of a substantial part of a referenced copyrighted work, such as a story, novel, picture, or software code.

Early on, due to the nature of GenAI, it was thought it would be rare for a GenAI to produce a perfect or near-verbatim copy of a single piece of training data. The GenAI merely studies the connections in the training data to encode information about such connections in its neural network. It’s not a copy machine. But it now appears that, in certain situations, it outputs copies of training data material.

How significant is this legal problem for GenAI companies and their customers? Under copyright law, a copyright owner can sometimes recover substantial damages. This liability could fall on both the GenAI maker and the GenAI user. For now, the lawsuits are primarily, if not entirely, against GenAI makers. They are the source of the issue and have money. But users of AI output could also be liable.

In practice, how big of a threat is this? GenAI makers claim that much of this infringing output must be caused by people deliberately prompting the GenAI to generate content they know would be copyright infringing. Such actions to intentionally seek copyright-infringing property might increase the liability of the GenAI user and reduce (but not eliminate) the liability of the GenAI maker.

GenAI makers have tried implementing guardrails to prevent their GenAIs from producing copyright-infringing material. While they don’t disclose how they do this, it appears they attempt to block certain kinds of prompt wording that call for well-known copyright property, such as popular comic book, video game, and movie characters. But that becomes a game of spy versus spy. GenAI users find ways to recraft their prompts to get around these guardrails. Also, guardrails likely are written to stop the output of copyright-infringing material when it’s known that people have been seeking it. You can’t guard everything. Ultimately, there is a trade-off: the more sensitive the guardrails, the less the GenAI will produce useful content generally.

GenAI makers could solve this problem by training only on public-domain information or information for which they purchase use licenses. That probably won’t happen. GenAIs need a lot of training data to do their magic well. Some academics claim all the information on the Internet may not be enough to train a GenAI optimally. The subset of public-domain information is much smaller. Also, it’s difficult to discern what is in the public domain and what is not. Most material on the Internet is someone’s copyright property, even if it doesn’t contain a copyright notice and was never registered for copyright. As for licensing, that could be highly expensive. The owners of certain kinds of content might be unwilling to license at anything less than exorbitant rates because of fear that GenAI will put them out of business, such as stock photo agencies.

Overall, we don’t have any information on the frequency of these copyright-infringing outputs. If they are frequent and can’t be largely eliminated by guardrails, it could mean game over for GenAI. We’ll see.

Written no February 21, 2024

By John B. Farmer

© 2024 Leading-Edge Law Group, PLC. All rights reserved.