Mass Law Blog

Copyright And The Challenge of Large Language Models

by | Jul 1, 2024

“AI models are what’s known in computer science as black boxes: You can see what goes in and what comes out; what happens in between is a mystery.”

Trust but Verify: Peeking Inside the “Black Box” of Machine Learning

In December 2023, The New York Times filed a landmark lawsuit against OpenAI and Microsoft, alleging copyright infringement. This case, along with a number of similar cases filed against AI companies, brings to the forefront a fundamental challenge in applying traditional copyright law to a revolutionary technology: Large Language Models (LLMs). More than any copyright case that precedes them, these cases grapple with a form of alleged infringement that defies conventional legal analysis.

This article, the first in a three-part series, aims to explain how AI companies acquire vast amounts of training data and what happens to this data as it is processed into “artificial intelligence.” By understanding these technical aspects of LLMs, we can better appreciate the nuanced legal questions that arise when this technology intersects with copyright law, and why traditional methods of proving copyright infringement may be inadequate in this new technological landscape.

Disclaimer: I’m not a computer or AI scientist. However, neither are the judges and juries that will be asked to apply copyright law to this technology. It’s unlikely that they will go much beyond the level of detail I’ve used here.

What are Large Language Models (LLMs) and How Do They Work?

Large Language Models, or LLMs, are gargantuan AI systems that use a vast corpus of training data and billions of parameters. They are designed to understand, generate, and manipulate human language. They learn patterns from the data, allowing them to perform a wide range of language tasks with remarkable fluency. Their inner workings are fundamentally different from any previous technology that has been the subject of copyright litigation, including traditional computer software.

At their core, LLMs operate on a principle of prediction. Given a sequence of words an LLM attempts to predict the most likely next word. This process is repeated, with each prediction informing the next, allowing the model to generate coherent text, answer questions, or perform other language-related tasks.

The “large” in Large Language Models refers to the enormous number of parameters these models can contain – sometimes in the trillions. These parameters are essentially the model’s learned patterns and relationships, fine-tuned through exposure to massive amounts of text data. The larger the dataset, the better the AI model.

AI companies employ various methods to acquire this data – 

– Web scraping and crawling. One of the primary methods of data acquisition is web scraping – the automated process of extracting data from websites. AI companies deploy sophisticated crawlers that systematically browse the internet, copying text from millions of web pages. This method allows for the collection of diverse, up-to-date information but raises questions about the use of copyrighted material without explicit permission.

– Partnerships and licensing agreements. Some companies enter into partnerships or licensing agreements to access high-quality, curated datasets. For instance, OpenAI has partnered with organizations like the Associated Press to use its news archives for training purposes.

– Public datasets and academic corpuses. Many LLMs are trained, at least in part, on publicly available datasets and academic text collections. These might include Project Gutenberg’s collection of public domain books, scientific paper repositories, or curated datasets like the Common Crawl corpus.

– User-generated content. Platforms that interact directly with users, such as ChatGPT, can potentially use the conversations and inputs from users to further train and refine their models. This practice raises privacy concerns and questions about the ownership of user-contributed data.

In the context of the New York Times lawsuit, it’s worth noting that OpenAI, like many AI companies, has not publicly disclosed the full extent of its training data sources. However, it’s widely believed that the company uses a combination of publicly available web content, licensed datasets, and partnerships to build its training corpus. The lawsuit alleges that this corpus includes copyrighted New York Times articles, obtained without permission or compensation.

Once acquired, the raw data undergoes several processing steps before it can be used to train an LLM – 

– Data preprocessing and cleaning. The first step involves cleaning the raw data. This includes removing irrelevant information, correcting errors, and standardizing the format. This may involve stripping away HTML tags, removing advertisements, or filtering out low-quality content.

– Tokenization and encoding. Next, the text is broken down into smaller units called tokens. These might be words, parts of words, or even individual characters. Each token is then converted into a numerical representation that the AI can process. This step is crucial as it determines how the model will interpret and generate language.

The Training Process: How Machines “Learn” From Data

During training, the LLM is exposed to this preprocessed data, learning to predict patterns and relationships between tokens. This is an iterative process where the model makes predictions, compares them to the actual data, and adjusts its internal parameters to improve accuracy. This process, known as “backpropagation,” is repeated billions of times across the entire dataset. In a large LLM this can take months, operating 24/7 on a massive system of graphics processing chips.

The Invisible Transformation: From Text to Numbers

For purposes of copyright law, here’s the crux of the matter – after this process, the original text no longer exists in any recognizable form within the LLM. The model becomes a vast sea of numbers, with no direct correspondence to the original text. This transformation creates a fundamental challenge for copyright law – 

– No Side-by-Side Comparison: In traditional copyright cases, courts rely heavily on comparing the original work side-by-side with the allegedly infringing material. With LLMs, this is impossible. You can’t “read” an LLM or print it out for comparison.

– Black Box Nature: The internal workings of LLMs are often referred to as a “black box.” Even the developers may not fully understand how the model arrives at its outputs.

– Dynamic Generation: LLMs don’t store and retrieve text; they generate it dynamically based on learned patterns. This means that any similarity to copyrighted material in the output is a result of statistical prediction, not direct copying.

– Distributed Information: Information from any single source is distributed across countless parameters in the model, making it impossible to isolate the influence of any particular work.

Legal Implications

This technological reality creates several unprecedented challenges for copyright law – 

– Proving Infringement: How can a plaintiff prove infringement when the allegedly infringing material can’t be directly observed or compared?

– Fair Use Analysis: Traditional fair use factors, such as the amount and substantiality of the portion used, become difficult to apply when the “portion used” is transformed beyond recognition.

– Substantial Similarity: The legal test of “substantial similarity” between works becomes almost meaningless in the context of LLMs.

– Expert Testimony: Courts will likely have to rely heavily on expert testimony to understand the technology, but even experts may struggle to definitively prove or disprove infringement.

For all of these reasons, to prove copyright infringement plaintiffs such as the New York Times may have to rely on the “intermediate” copies that are used to start the training process, rather than the “end-product” LLMs that are accessed by users. And this, as we’ll see in my next post in this series, may favor OpenAI and other AI companies. 

Conclusion

The NYT v. OpenAI case highlights a fundamental mismatch between traditional copyright law and the reality of LLM technology. As I move forward in this series, I’ll explore the legal theories and potential new frameworks that might be necessary to address these challenges. The outcome of this case could reshape our understanding of copyright in the digital age, potentially requiring new legal tests and standards that can account for the invisible, transformed nature of information within AI systems.

Part 2 in this series will focus on the legal issues around the “input problem” of using copyrighted material for training. Part 3 will look at the “output problem” of AI-generated content that may copy or resemble copyrighted works, including what the AI industry calls “memorization.” As we’ll see, each of these issues presents its own unique challenges in the context of a technology that defies traditional legal analysis.