Response to "Copyright Liability On LLMs Should Mostly Fall On The Prompter, Not The Service"
- Lewis Sorokin
- Jan 12, 2024
- 5 min read
I recently came across an interesting new copyright framework proposal for artificial intelligence large language models (LLMs) by esteemed technology attorney Ira Rothken. He likens an LLM to a VCR in that both can be led by a user to infringe prior works, calling for a refreshed Sony Doctrine for the age of AI: a TAO ("Training and Output") Doctrine.
This TAO Doctrine posits that:
... if a good faith AI LLM engine is trained using copyrighted works, where the (1) original work is not replicated but rather used to develop an understanding, and (2) the outputs generated are based on user prompts, the responsibility for any potential copyright infringement should lie with the user, not the AI system.
These factors touch on the right issues. However, I argue that both this approach and the analogy between LLMs and VCRs are not entirely realistic given the current direction of artificial intelligence offerings on the market and the application of copyright law. Moreover, I raise the possibility that generative AI as a whole is falling victim to the same flawed legal risk analysis that led to the fall of Napster.
In setting the stage, Mr. Rothken correctly notes that there is a dual-use nature to both LLMs and VCRs in that:
They are capable of transient and modest infringing activities when prompted or used inappropriately by users, but more significantly, they possess a vast potential for beneficial, non-infringing uses such as educational enrichment, idea enhancements, and advances in language processing.
This gives rise to the premise that an LLM is an "idea engine" when trained on copyrighted works. This is clarified by the explanation that training an LLM "is akin to a person learning a language through various sources but then using that language independently to create new sentences." Taking this a step further, the TAO Doctrine presumes that an LLM processes ideas, which only once they become meaningful (and copyrightable) expressions when the LLM produces an output.
I disagree; instead, I argue that an LLM functions more like an "expression engine" when specific expressions exist in the training data as inputs, regardless of the extent to which they influence an output. This fundamentally cuts against the first factor of the TAO Doctrine, that "original work is not replicated but rather used to develop an understanding." There must be a middle ground balancing the benefits of innovation and the needs of rightsholders with respect to use of their IP and compensation for it.
That is not to say, however, that there should be no risk allocated to end users; I believe just the contrary. To call back to the comparison between LLMs and VCRs: just about every VHS included a copyright warning message. If a viewer chose to run that red light, that viewer would rightfully be liable for the infringement and related penalties. More to the point, the analogy between VCRs and LLMs might be more apt if every VCR came with millions of unlicensed television episodes and movies built-in and all a user had to do to access them was to press the right buttons. But then again, that would just be Pirate Bay.

That said, an LLM is like a VCR in that both can be used to infringe copyright (hence the "dual-purpose"). Take, for example, the tendency of Anthropic's Claude to output well-known song lyrics (over which Universal Music Publishing Group and two smaller publishers filed suit in October 2023). Suppose a user prompts Claude to write song lyrics about a party in the USA. If Claude outputs the chorus to Miley Cyrus' "Party in the USA" along with some newly written verses, certainly this was a result of the user's prompt. That said, Anthropic should be primarily responsible for this infringement, given that the unlicensed copyrighted work substantially existed in its dataset and was easily extracted from it by the end user. Perhaps this example runs afoul of the first factor of the TAO Doctrine in that the lyrics of the song chorus were, in fact, directly replicated in the dataset and not merely "understood" by it, but it would seem that this is simply the nature of LLMs. With any "understanding" comes some extent of copying.
A similar example comes from the recently filed New York Times v. OpenAI lawsuit. NYT cited several examples where ChatGPT could be prompted by a user to output entire NYT articles nearly verbatim. Without proper controls in ChatGPT's instructions to prevent outputs that line up so directly to inputs, it is difficult to argue that OpenAI has no liability. Perhaps OpenAI could build stronger copyright protections into the public-facing ChatGPT app so as to prevent any output from too closely matching anything in the foundational GPT-4 model's training data. Or, better yet, perhaps GPT-5 and beyond could ensure that the data sets on which they are trained are fully licensed. More to the point, it is vital for continued innovation in the arts, culture, and creative industries that creators and other rightsholders are duly compensated. If Getty Images can do it, the rest of the AI industry can follow suit.
And there's the rub. OpenAI made its position – that licensing IP is bad for business – crystal clear in a December 5, 2023 statement to the United Kingdom's House of Lords, saying:
Because copyright today covers virtually every sort of human expression–including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials.
The Machiavellian argument that the ends justify the means might be persuasive in a classroom, but I struggle to see this is a rationale for taking sticks away from the copyright bundle. Rather, I see this approach as upending the fundamental notion that copyright is an automatic right granted to authors, regardless of any contravening public policy arguments, unless the author waives this right with a license like CC0. Then again, licensing would have been bad for Napster's business too.
This highlights a serious problem facing the current AI product market. The assumption that using copyrighted works for training data is sufficiently transformative to constitute fair use has not been tested in the courts, but if Warhol v. Goldsmith is any indication, the Supreme Court is not itching to carve out new fair use exceptions. In fact, Justice Kagan's dissent in Warhol made a similar argument to many in the "training data = fair use" camp. Kagan argued that a ruling that even the works of Andy Warhol are not sufficiently transformative to constitute fair use would:
It will stifle creativity of every sort. It will impede new art and music and literature. It will thwart the expression of new ideas and the attainment of new knowledge. It will make our world poorer.
Indiscriminate mass copying for the purposes of innovation while neglecting IP rights along the way will be to the detriment of creativity. The fears Justice Kagan expressed will be realized.
With all that said, it is true that copyright at its strictest can stifle creativity and innovation. I have made similar arguments. But there is a fine line between infringement and fair use, and even Warhol's "Prince" series did not pass the muster of a fair use analysis. Nor did Napster's peer-to-peer music streaming service. Just because the TAO Doctrine "could safeguard AI development and deter the floodgates of litigation" (as Mr. Rothken puts it) does not make it the right tool for the job. Generative AI stands to be a generational technology, one which could revolutionize productivity. Certainly, overly-stringent IP policy should not stifle America’s potential for advancements. But those noble ends must be balanced with the equally noble, albeit less glamorous, ends of ensuring that copyright protects those who it is intended to protect.
The only way for generative artificial intelligence to mature as a technology is for the companies building foundation models to accept the cost of doing business. In short, AI must move past its Napster moment.
Comments