by Lee Gesmer | Dec 30, 2024 | General
I’ve been belatedly reading Chris Miller’s Chip War, so I’m particularly attuned to U.S.-China relations around technology. Of course, the topic of Miller’s excellent book is advanced semiconductor chips, not social media apps. Nevertheless, the topic now occupying the attention of the Supreme Court and the president elect is the national security threat presented by a social media app used by an estimated 170 million U.S. users.
With Miller’s book as background I was interested when, on December 6, 2024, the D.C. Circuit Court of Appeals denied TikTok’s petitions challenging the constitutionality of the Protecting Americans from Foreign Adversary Controlled Applications Act. This statute, which was signed into law on April 24, 2024, mandated that TikTok’s parent company, ByteDance Ltd., divest its ownership of TikTok within 270 days or face a nationwide ban in the United States. The law reflected Congress’s concerns that ByteDance and, by extension, the Chinese government, constituted a national security threat due to concerns about data collection and potential content manipulation.
The effect of the D.C. Circuit’s decision is that ByteDance must divest itself of TikTok by January 19, 2025, the day before the Presidential inauguration.
It took TikTok and ByteDance only ten days – until December 16, 2024 – to file with the Supreme Court an emergency motion for injunction, pending full review by the Court. And it then took the Supreme Court only two days to treat this motion as a petition for a writ of certiorari, allow the petition and put the case on the Supreme Court version of a “rocket docket” – briefing must be completed by January 3, 2025, and the Court will hear oral argument on January 10th, giving it plenty of time to decide the issue in the nine days left until January 19th.
Enter Donald Trump. In a surprising twist, the former president – who initially tried to ban TikTok in 2020 – has filed an amicus brief opposing an immediate ban. He contends that the January 19th deadline improperly constrains his incoming administration’s foreign policy powers, and he wants time to negotiate a solution balancing security and speech rights:
President Trump is one of the most powerful, prolific, and influential users of social media in history. Consistent with his commanding presence in this area, President Trump currently has 14.7 million followers on TikTok with whom he actively communicates, allowing him to evaluate TikTok’s importance as a unique medium for freedom of expression, including core political speech. Indeed, President Trump and his rival both used TikTok to connect with voters during the recent Presidential election campaign, with President Trump doing so much more effectively. . . .
[Staying the statutory deadline] would . . . allow President Trump’s Administration the opportunity to pursue a negotiated resolution that, if successful, would obviate the need for this Court to decide these questions.
Trump Amicus Brief, pp. 2, 9
The legal issues are novel and significant. The D.C. Circuit applied strict scrutiny but gave heavy deference to national security concerns while spending little time on users’ speech interests. Trump raises additional separation of powers questions about Congress dictating national security decisions and mandating specific executive branch procedures.
This case isn’t just about one app. The case reflects deeper tensions over Chinese technological influence, data privacy, and government control of social media. The Court’s decision will likely shape how we regulate foreign-owned platforms while protecting constitutional rights in an interconnected world.
The January 10th arguments – if indeed they go forward on that date, given that president-elect Trump prefers they not – should be fascinating. At stake is not just TikTok’s fate, but precedent for how courts balance national security claims against free speech in the digital age.
________
Addendum:
The highly expedited schedule kept a lot of lawyers busy over the holidays. You can access the docket here. I count 22 amicus briefs, most filed on December 27. Reply briefs are due January 3.
by Lee Gesmer | Dec 11, 2024 | Copyright, DMCA/CDA
After reading my 3-part series on copyright and LLMs (start with Part 1, here) a couple of colleagues have asked me whether content owners could use the Digital Millennium Copyright Act (DMCA) to challenge the use of their copyright-protected content.
I’ll provide a short summary of the law on this issue, but the first thing to note is that the DMCA offers two potential avenues for content owners: Section 512(c)‘s widely-used ‘notice and takedown’ system and the lesser-known Section 1202(b)(1), which addresses the removal of copyright management information (CMI), like author names, titles, copyright notices and terms and conditions .
Section 1202(b)(1) – Removal or Alteration of CMI
First, let’s talk about the lesser-known DMCA provision. Several plaintiffs have tried an innovative approach under this provision, arguing that AI companies violated Section 1202(b)(1) by stripping away CMI in the training process.
In November, two federal judges in New York reached opposite conclusions on these claims. In Raw Story Media, Inc. v. OpenAI the plaintiff alleged that OpenAI had removed CMI during the training process, in violation of 1202(b)(1). The court applied the standing requirement established in Transunion v. Ramirez, a recent Supreme Court case that dramatically restricted standing to sue in federal courts to enforce federal statutes. The court held that the publisher lacked standing because it couldn’t prove that it had suffered “concrete harm” from the alleged CMI removal from ChatGPT. The court based this conclusion on the fact that Raw Story “did not allege that a copy of its work from which the CMI had been removed had been disseminated by ChatGPT to anyone in response to any specific query.” Absent dissemination Raw Media had no claim – under Transunion, “no concrete harm, no standing.”
But weeks later, in The Intercept Media v. OpenAI, a different judge issued a short order allowing similar claims to proceed. We are awaiting the opinion explaining his rationale.
The California federal courts have also been unwelcoming to 1202(b)(1) claims. In two cases – Anderson v. Stability AI and Doe 1 v. Gitub the courts dismissed 1202(b)(1) claims on the ground that the removal of CMI requires identicality between the original work and the copy, which the plaintiffs had failed to establish. However, the Github case has been certified for an interlocutory appeal to the Ninth Circuit, and that appeal is worth watching. I’ll note that the identicality requirement is not in the Copyright Act – it is an example of judge-made copyright doctrine.
Section 512(c) – Notice-and-Takedown
While you are likely familiar with the DMCA’s Section 512(c) notice-and-takedown system (think YouTube removing copyrighted videos or music), this law faces major hurdles in the AI context. A DMCA take-down notice must be specific about the location where the infringing material is hosted – typically a URL. In the case of an AI model the challenge is that data used by AI models is not accessible or identifiable, making it impossible for copyright owners to issue takedown notices.
Unsurprisingly, I can’t find any major AI case in which a plaintiff has alleged violation of Section 512(c).
Conclusion
The collision between AI technology and copyright law highlights a fundamental challenge: our existing legal framework, designed for the digital age of the late 1990s, struggles to address the unique characteristics of AI systems. The DMCA, enacted when peer-to-peer file sharing was the primary concern, now faces unprecedented questions about its applicability to AI training data.
Stay tuned.
by Lee Gesmer | Nov 27, 2024 | Copyright
In the first two parts of this series I examined how large language models (LLMs) work and analyzed whether their training process can be justified under copyright law’s fair use doctrine. (Part 1, Part 2). However, I also noted that LLMs sometimes “memorize” content during the training stage and then “regurgitate” that content verbatim or near-verbatim when the model is accessed by users.
The “memorization/regurgitation” issue is featured prominently in The New York Times Company v. Microsoft and OpenAI, pending in federal court in the Southern District of New York. (A reference to OpenAI includes Microsoft, unless the context suggests otherwise). Because the technical details of every LLM and AI model are different I’m going to focus my discussion of this issue mostly on the OpenAI case. However, the issue has implications for every generative LLM trained on copyrighted content without permission.
The Controversy
Here’s the problem faced by OpenAI.
OpenAI claims that the LLM models it creates are not copies of the training data it uses to develop its models, but uncopyrightable patterns. In other words, a ChatGPT LLM is not a conventional database or search engine that stores and retrieves content. As OpenAI’s co-defendant Microsoft explains, “A program called a ‘transformer’ evaluates massive amounts of text, converts that text into trillions of constituent parts, discerns the relationships among all of them, and yields a natural language machine that can respond to human prompts.” (Link, p. 2)
However, the Times has presented compelling evidence that challenges this narrative. The Times showed that it was able to prompt OpenAI’s ChatGPT-4 and Microsoft’s CoPilot to produce lengthy, near-verbatim excerpts from specific Times articles, which the Times then cited in its complaint as proof of infringement.
The Times’ First Amended Complaint includes an exhibit with over 100 examples of ChatGPT “regurgitating” Times content verbatim or near-verbatim in response to specific prompts (copied text highlighted in red; click to enlarge):
This evidence poses a fundamental question: If OpenAI’s models truly transform copyright-protected content into abstract patterns rather than storing it, how can they reproduce exact or nearly exact copies of that content?
The Times argues that this evidence reveals a crucial truth: actual copyrighted expression—not just abstract patterns—is encoded within the model’s parameters. This allegation strikes at the foundation of OpenAI’s legal position and weakens its fair use defense by suggesting its use of copyrighted material is more extensive and less transformative than claimed.
Just how big a problem this is for OpenAI and the AI industry is difficult to determine. I’ve tried to replicate it in a variety of cases on ChatGPT and several other frontier models without success. In fact I can’t get the models to give me the text of Moby Dick, Tom Sawyer or other literary works whose copyrights have long expired.
Nevertheless, the Times was able to do this one hundred times, and it’s safe to assume that it could have continued well past that number, but thought that 100 examples was enough to make the point in its lawsuit.
OpenAI: “The Times Hacked ChatGPT”
What’s OpenAI’s response to this?
To date, OpenAI and Microsoft have not filed answers to the Complaint. However, they have given an indication of how they view these allegations in partial motions to dismiss filed by both companies.
Microsoft’s motion (p. 2) argues that the NYT’s methods to demonstrate how its content could be regurgitated did not represent real-world usage of the GPT tools at issue. “The Times,”it argues, “crafted unrealistic prompts to try to coax the GPT-based tools to output snippets of text matching The Times’s content.” (Emphasis in original) To get the NYT content regurgitated, a user would need to know the “genesis of that content.” “And in any event, the outputs the Complaint cites are not copies of works at all, but mere snippets” that do not rise to the level of copyright infringement.
OpenAI’s motion (p. 12.) argues that the NYT “appears to have [used] prolonged and extensive efforts to hack OpenAI’s models”:
In the real world, people do not use ChatGPT or any other OpenAI product for that purpose, … Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will. . . . The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results that make up Exhibit J to the Complaint. They were able to do so only by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use. And even then, they had to feed the tool portions of the very articles they sought to elicit verbatim passages of, virtually all of which already appear on multiple public websites. Normal people do not use OpenAI’s products in this way.
It appears that OpenAI is referring to the provisions in its Terms of Service that prohibit anyone from “Us[ing] our Services in a way that infringes, misappropriates or violates anyone’s rights” or to “extract data.” OpenAI has labeled these “adversarial attacks.”
Copyright owners don’t buy this tortured explanation. As OpenAI has admitted in a submission to the Patent and Trademark Office, “An author’s expression may be implicated . . . because of a similarity between her works and an output of an AI system.” (link, n. 71, emphasis added).
Rights holders claim that their ability to extract memorized content from these systems puts the lie to, for example, OpenAI’s assertion that “an AI system can eventually generate media that shares some commonalities with works in the corpus (in the same way that English sentences share some commonalities with each other by sharing a common grammar and vocabulary) but cannot be found in it.” (link, p. 10).
Moreover, OpenAI’s “hacking” defense would seem to inadvertently support the Times’ position. After all, you cannot hack something that isn’t there. The very fact that this content can be extracted, regardless of the method, suggests it exists in the form of an unauthorized reproduction within the model.
OpenAI: “We Are Protected by the Betamax Case”
How will OpenAI and Microsoft respond to these allegations under copyright law?
To date, OpenAI and Microsoft have yet to file formal answers to the Times’ complaint. However, they have given us a hint of their defense strategy in their motions to dismiss, and it is based in part on the Supreme Court’s 1984 decision in Sony v. Universal City Studios, a case often referred to as “the Betamax case.”
In the Betamax case a group of entertainment companies sued Sony for copyright infringement, arguing that consumers used Sony VCRs to infringe by recording programs broadcast on television. The Supreme Court held that Sony could not be held contributorily liable for infringements committed by VCR owners. “[T]he sale of copying equipment . . . does not constitute contributory infringement if the product is . . . capable of substantial noninfringing uses.”
The take-away from this case is that under copyright law if a product can be put to either a legal or illegal purpose by end-users (a “dual-use”), it is not infringing so long the opportunity for noninfringing use is substantial.
OpenAI and Microsoft assert that the Betamax case applies because, like the VCR, ChatGPT is a “dual-use technology.” While end users may be able to use “adversarial prompts” to “coax” a model to produce a verbatim copy of training data, the system itself is a neutral, general-purpose tool. In most instances it will be put to a non-infringing use. Citing the Betamax case Microsoft argues that “copyright law is no more an obstacle to the LLM than it was to the VCR (or the player piano, copy machine, personal computer, internet, or search engine)”—all dual-use technologies.
No doubt, in support of this argument OpenAI will place strong emphasis on ChatGPT’s many important non-infringing uses. The model can create original content, analyze public domain texts, process user-provided content, educate, generate software code, and more.
However, OpenAI’s reliance on the Betamax dual-use doctrine faces a challenge central to the doctrine itself. The Betamax case was based on secondary liability—whether Sony could be held responsible for consumers using VCRs to record television programs. The alleged infringements occurred through consumer action, not through any action taken by the device’s manufacturer.
But with generative LLMs such as ChatGPT the initial copying happens during training when the model memorizes copyrighted works. This is direct infringement by the AI company itself, not secondary infringement based on user prompts. When an AI company creates a model that memorizes and can reproduce copyrighted works, the company itself is doing the copying—making this fundamentally different from Betamax.
Before leaving this topic it’s important to note that the full scope of memorization within AI models of GPT-4’s scale may be technically unverifiable. While the models’ creators can detect some instances of memorization through testing, due to the scale and complexity of the models they cannot comprehensively examine their internal representations to determine the full extent of memorized copyrighted content. While the Times’ complaint demonstrated one hundred instances of verbatim copying, this could represent just the tip of the iceberg, or conversely, the outer limit of the problem. This uncertainty itself poses a significant challenge for courts attempting to apply traditional copyright principles.
Technical Solutions
While these legal issues work their way through the courts, AI companies aren’t standing still. They recognize that their long-term success may depend on their ability to prevent or minimize memorization, regardless of how courts ultimately rule on the legal issues.
Their approaches to this challenge vary. OpenAI has told the public that it is taking measures to prevent the types of copying illustrated in the Times’ lawsuit: “we are continually making our systems more resistant to adversarial attacks to regurgitate training data, and have already made much progress in our recent models.” (link) This includes filtering or modifying user prompts to reject certain requests before they are submitted as prompts to the model and aligning the models to refuse to produce certain types of data. Try asking ChatGPT to give you the lyrics to Arlo Guthrie’s “Alice’s Restaurant Massacree” or Taylor Swift’s “Cruel Summer.” It will tell you that copyright law prohibits it from doing so.
And, it’s important to note that different AI companies are taking different approaches to this problem. For example, Google (which owns Gemini) uses supervised fine tuning (explained here). Anthropic (which owns Claude) focuses on what it calls “constitutional AI” – a training methodology that builds in constraints against certain behaviors, including the reproduction of copyrighted content. (link here). Meta (LLaMA models) has implemented what it calls “deduplication” during the training process – actively removing duplicate or near-duplicate content from training data to reduce the likelihood of memorization. Additionally, Meta has developed techniques to detect and filter out potential memorized content during the model’s response generation phase. (link here).
Conclusion
The AI industry faces a fundamental challenge that sits at the intersection of technology and law. Current research suggests that some degree of memorization may be inherent to large language models – raising a crucial question for courts: If memorization cannot be eliminated without sacrificing model performance, how should copyright law respond?
The answer could reshape both AI development and copyright doctrine. AI companies may need to accept reduced performance in exchange for legal compliance, while content creators must decide whether to license their works for AI training despite the risk of memorization. The industry’s ability to develop systems that truly learn patterns without memorizing specific expressions – or courts’ willingness to adapt copyright law to this technological reality – may determine its future.
The outcome of the Times lawsuit may establish crucial precedents for how copyright law treats AI systems that can memorize and reproduce protected content. At stake is not just the legality of current AI models, but the broader question of how to balance technological innovation with the rights of content creators in an era where the line between learning and copying has become increasingly blurred.
by Lee Gesmer | Oct 15, 2024 | Copyright
“Fair use is the great white whale of American copyright law. Enthralling, enigmatic, protean, it endlessly fascinates us even as it defeats our every attempt to subdue it.” – Paul Goldstein
__________________________
This is the second in a 3-part series of posts on Large Language Models (LLMs) and copyright. (Part 1 here)
In this post I’ll turn to a controversial and important legal question: does the use of copyrighted material in training LLMs for generative AI constitute fair use? This analysis requires a nuanced understanding of both copyright fair use and the technical aspects of LLM training (see Part 1). To examine this complex issue I’ll look at recent relevant case law and consider potential solutions to the legal challenges posed by AI technology.
Introduction
The issue is this: generative AI systems – systems that generate text, graphics, video, music – are being trained without permission on copies of millions of copyrighted books, artwork, software and music scraped from the internet. However, as I discussed in Part 1 of this series, the AI industry argues that the resulting models themselves are not infringing. Rightsholders argue that even if this is true (and they assert that it is not), the use of their content to train AI models is infringing, and that is the focus of this post.
To put this in perspective, consider where AI developers get their training data. It’s generally acknowledged that many of them have used resources such as Common Crawl, a digital archive containing 50 billion web pages, and Books3, a digital library of thousands of books. While these resources may contain works that are in the public domain, there’s no doubt that they contain a huge quantity of works that are protected by copyright.
In the AI industry, the thirst for this data is insatiable – the bigger the language models, the better they perform, and copyrighted works are an essential component of this data. In fact, the industry is already looking at a “data wall,” the time when they will run out of data. They may hit that wall in the next few years. If copyrighted works can’t be included in training data, it will be even sooner.
Rightsholders assert that the use of this content to train LLMs is outright, massive copyright infringement. The AI industry responds that fair use – codified in 17 U.S.C. § 107 – covers most types of model training where, as they assert, the resulting model functions differently than the input data. This is not just an academic difference – the issue is being litigated in more than a dozen lawsuits against AI companies, attracting a huge amount of attention from the copyright community.
No court has yet ruled on whether fair use protects the use of copyright-protected material as training material for LLMs. Eventually, the courts will answer this question by applying the language of the statute and the court decisions applying copyright fair use.
Legal Precedents Shaping the AI Copyright Landscape
To understand how the courts are likely to evaluate these cases we need to look at four recent cases that have shaped the fair use landscape: the two Google Books cases, Google v. Oracle, and Warhol Foundation v. Goldsmith. In addition the courts are likely to apply what is known as the “intermediate copying” line of cases.
The Google Books Cases. Let’s start with the two Google Books cases, which in many ways set the stage for the current AI copyright dilemma. The AI industry has put its greatest emphasis on these cases. (OpenAI: “Perhaps the most compelling case on point is Authors Guild v. Google”).
Authors Guild v. Google and Author’s Guild v. Hathitrust. In 2015, the Second Circuit Court of Appeals decided Authors Guild v. Google, a copyright case that had been winding through the courts for a decade. Google had scanned millions of books without permission from rightsholders, creating a searchable database.
The Second Circuit held that this was fair use. The court’s decision hinged on two key points. First, the court found Google’s use highly “transformative,” a concept central to fair use. Google wasn’t reproducing books for people to read; it was creating a new tool for search and analysis. While Google allowed users to see small “snippets” of text containing their search terms, this didn’t substitute for the actual books. Second, the court found that Google Books was more likely to enhance the market for books than harm it. The court also emphasized the immense public benefit of Google Books as a research tool.
A sister case in the Google Books saga was Authors Guild v. HathiTrust, decided by the Second Circuit in 2014. HathiTrust, a partnership of academic institutions, had created a digital library from book scans provided by Google. HathiTrust allowed researchers to conduct non-consumptive research, such as text mining and computational analysis, on the corpus of digitized works. Just as in Google Books, the court found the creation of a full-text searchable database to be a fair use, even though it involved copying entire works. Importantly, the court held this use of the copyrighted books to be transformative and “nonexpressive.”
The two cases were landmark fair use decisions, especially for their treatment of mass digitization and nonexpressive use of copyrighted works – a type of use that involves copying copyrighted works but does not communicate the expressive aspects of those works.
These two cases, while important, by no means guarantee the AI industry the fair use outcome they are seeking. Reliance on Google Books falters given the scope of potential output of AI models. Unlike Google Books’ limited snippets, LLMs can generate extensive text that may mirror the style and substance of copyrighted works in their training data. This raises concerns about market harm, a critical factor in fair use analysis, and whether LLM-generated content could eventually serve as a market substitute for the original works. The New York Times argues just this in its copyright infringement case against OpenAI and Microsoft.
Hathitrust is an even weaker precedent for LLM fair use. The Second Circuit held that HathiTrust’s full-text search “posed no harm to any existing or potential traditional market for the copyrighted works.” LLMs, in contrast, have the potential to generate content that could compete with or substitute for original works, potentially impacting markets for copyrighted material. Also, HathiTrust was created by universities and non-profit institutions for educational and research purposes. Commercial LLM development may not benefit from the same favorable consideration under fair use analysis.
In sum, the significant differences in purpose, scope, and potential market impact make both Google Books and Hathitrust imperfect authorities for justifying the comprehensive use of copyrighted materials in training LLMs.
Google v. Oracle. Fast forward to 2021 for another landmark fair use case, this time involving software code. In Google v. Oracle, the Supreme Court held that Google’s copying of 11,500 lines of code from Oracle’s Java API was intended to facilitate interoperability, and was fair use.
The Court found Google’s “purpose and character” was transformative because it “sought to create new products” and was “consistent with that creative ‘progress’ that is the basic constitutional objective of copyright itself.” The Court also downplayed the market harm to Oracle, noting that Oracle was “poorly positioned to succeed in the mobile phone market.”
This decision seemed to open the door for tech companies to make limited use of some copyrighted works in the name of innovation. However, the case’s focus on functional code limits its applicability to LLMs, which are trained on expressive works like books, articles, and images. The Supreme Court explicitly recognized the inherent differences between functional works, which lean towards fair use, and expressive creations at the heart of copyright protection. So, again, the AI industry will have difficulty deriving much support from this decision.
And, before we could fully digest Oracle’s implications for fair use, the Supreme Court threw a curveball.
Andy Warhol Foundation v. Goldsmith. In 2023, the Court decided Andy Warhol Foundation v. Goldsmith (Warhol), a case dealing with Warhol’s repurposing of a photograph of the musician Prince. While the case focused specifically on appropriation art, its core principles resonate with the ongoing debate surrounding LLMs’ use of copyrighted materials.
The Warhol decision emphasizes a use-based approach to fair use analysis, focusing on the purpose and character of the defendant’s use, particularly its commercial nature, and whether it serves as a market substitute for the original work. This emphasis on commerciality and market substitution poses challenges for LLM companies defending the fair use of copyrighted works in training data. The decision underscores the importance of considering potential markets for derivative works. As the use of copyrighted works for AI training becomes increasingly common, a market for licensing such data is emerging. The existence of such a market, even if nascent, could weaken the argument that using copyrighted materials for LLM training is a fair use, particularly when those materials are commercially valuable and readily licensable
The “Intermediate Copying” Cases. I also expect the AI industry to rely on the case law on “intermediate copying.” In this line of cases the users copied material to discover unprotectable information or as a minor step towards developing an entirely new product. So the final output – despite using copied material as an intermediate step – was noninfringing. In these cases the “intermediate use” was held to be fair use. See Sega v. Accolade (9th Cir. 1992) (defendant copied Sega’s copyrighted software to figure out the functional requirements to make games compatible with Sega’s gaming console). Sony v. Connectix (9th Cir. 2000)(defendant used a copy of Sony’s software to reverse engineer it and create a new gaming platform on which users could play games designed for Sony’s gaming system).
AI companies likely will argue that, just as in these cases, LLMs study language patterns as part of the process of transforming intermediate copying into noninfringing materials. Rightsholders likely will argue that whereas in those cases the copiers sought to study functionality or create compatibility, the scope and nature of use and the resulting product are vastly different from LLM fair use. I expect rightsholders will have the better argument on these cases.
Applying Legal Precedents to AI
So, where does this confusing collection of cases leave us? Here’s a summary:
The Content Industry Position – in a Nutshell: Rightsholders argue that – even assuming that the final LLM model does not contain expressive content (which they dispute) – the use of copyrighted works to train LLMs is an infringement not excused by fair use. They argue that all four fair use factors weigh against AI companies:
– Purpose and character: Many (but not all) AI applications are commercial, which cuts against the industries’ fair use argument, especially in light of Warhol’s emphasis on commercial purpose and the potential licensing market for training data. The existence of a licensing market for training datasets suggests that AI companies can obtain licenses rather than rely on fair use defenses. This last point – market effect – is particularly important in light of the Supreme Court’s holding in Andy Warhol.
– Nature of the work: Unlike the computer code in Google v. Oracle, which the Supreme Court noted receives “thin” protection, the content ingested by AI companies contains highly creative works like books, articles, and code. This distinguishes Oracle from AI training, and cuts against fair use.
– Amount used: Entire works are copied, a factor that weighs against fair use.
– Market effect: End users are able to extract verbatim content from LLMs, harming the market for original works and, as noted above , harming current and future AI training licensing markets.
The AI Industry Position – in a Nutshell. The AI industry will argue that the use of copyrighted works should be considered fair use:
– Transformative Use: The AI industry argues that AI training creates new tools with different purposes from the original works, using copyright material in a “nonexpressive” way. AI developers draw parallels to “context shifting” fair use cases dealing with search engines and digital libraries, such as the Google Books project, arguing AI use is even more transformative. I expect them to rely on Google v. Oracle to argue that, just as Google’s use of Oracle’s API code was found to be transformative because it created something new that expanded the use of the original code (the Android platform), AI training is transformative, as it creates new systems with different purposes from the original works. Just as the Supreme Court emphasized the public benefit of allowing programmers to use their acquired skills, similarly AI advocates are likely to highlight the broad societal benefits and innovation enabled by LLMs trained on diverse data.
– Intermediate Copying. AI proponents will support this argument by pointing to the “intermediate copying” line of cases, which hold that using copyrighted works for purposes incidental to a nonexpressive purpose (creating the non-infringing model itself), is permissible fair use.
– Market Impact: AI proponents will argue that AI training, and the models themselves, do not directly compete with or substitute for the original copyrighted works.
– Amount and Substantiality: Again, relying on Google v. Oracle, AI proponents will note that despite Google copying entire lines of code, the Court found fair use. This will support their argument that copying entire works for AI training doesn’t preclude fair use if the purpose is sufficiently transformative.
– Public Benefit: In Google v. Oracle the Court showed a willingness to interpret fair use flexibly to accommodate technological progress. AI proponents will rely on this, and argue that applying fair use to AI training has social benefits and aligns with copyright law’s goal of promoting progress. The alternative, restricting access to training data, could significantly hinder AI research and development. (AI “doomers” are unlikely to be persuaded by this argument).
– Practical Necessity: Given the vast amount of data needed, obtaining licenses for all copyrighted material used in training is impractical, impossible or would be so expensive that it would stifle AI development.
As I noted above, It’s important to note that, as alleged in several of the lawsuits filed to date, some generative AI models have “memorized” copyrighted materials and are able to output them in a way that could substitute for the copyrighted work. If the outputs of a system can infringe, the argument that the system itself does not implicate copyright’s purposes will be significantly weakened.
While Part 3 of this series will explore these output-related issues in depth, it’s important to recognize the intrinsic link between these concerns and input-side training challenges. In assessing AI’s impact on copyright law courts may adopt a holistic approach, considering the entire content lifecycle – from data ingestion to LLMs to final output. This interconnected perspective reflects the complex nature of AI systems, where training methods directly influence both the characteristics and potential infringement risks of generated content.
Potential Solutions and Future Directions
As challenging as these issues are, we need to start thinking about practical solutions that balance the interests of AI developers, content creators, and the public. Here are some possibilities, along with their potential advantages and drawbacks.
Licensing Schemes: One proposed solution is to develop comprehensive licensing systems for AI training data, similar to those that exist for certain music uses. This could provide a mechanism for compensating creators while ensuring AI developers have access to necessary training data.
Proponents argue that this approach would respect copyright holders’ rights and provide a clear framework for legal use. However, critics rightly point out that implementing such a system would be enormously complex and impractical. The sheer volume of content used in AI training, the difficulty of tracking usage, and the potential for exorbitant costs could stifle innovation, particularly for smaller AI developers.
New Copyright Exceptions: Another approach is to create specific exemptions for AI training, perhaps limited to non-commercial or research purposes. This could be similar to existing fair use exceptions for research and could promote innovation in AI development. The advantage of this approach is that it provides clarity and could accelerate AI research. However, defining the boundaries of “non-commercial” use in the rapidly evolving AI landscape could prove challenging.
International Harmonization: Given the global nature of AI development, the industry may need to work towards a unified international approach to copyright exceptions for AI. This could involve amendments to international copyright treaties or the development of new AI-specific agreements. However, international copyright negotiations are notoriously slow and complex. Different countries have varying interests and legal traditions, which could make reaching a consensus difficult.
Technological Solutions: We should also consider technological approaches to addressing these issues. For instance, AI companies could develop more sophisticated methods to anonymize or transform training data, making it harder to reconstruct original works on the “output” side. They could also implement filtering systems to prevent the output of copyrighted material. While promising, these solutions would require significant investment and might not fully address all legal concerns. There’s also a risk that overzealous filtering could limit the capabilities of AI systems.
Hybrid Approaches: Perhaps the most promising solutions will combine elements of the above approaches. For example, we could see a tiered system where certain uses are exempt, others require licensing, and still others are prohibited. This could be coupled with technological measures such as synthetic training data, and international guidelines.
Market-Driven Solutions: As the AI industry matures, we are likely to see the emergence of new business models that naturally address some of these copyright concerns. For instance, content creators might start producing AI-training-specific datasets, or AI companies might vertically integrate to produce their own training content. X’s Grok AI product and Meta are examples of this.
As we consider these potential solutions, it’s crucial to remember that the goal of copyright law is to foster innovation while fairly compensating creators and respecting intellectual property rights. Any solution will likely require compromise from all stakeholders and will need to be flexible enough to adapt to rapidly changing technology.
Moreover, these solutions will need to be developed with input from a diverse range of voices – not just large tech companies and major content producers, but also independent creators, smaller AI startups, legal experts, and public interest advocates. The path forward will require creativity, collaboration, and a willingness to rethink traditional approaches to copyright in the artificial intelligence age.
Conclusion – The Road Ahead
The intersection of AI and copyright law presents complex challenges that resist simple solutions. The Google Books cases provide some support for mass digitization and computational use of copyrighted works. Google v. Oracle suggests courts might look favorably on uses that promote new and beneficial AI technologies. But Warhol reminds us that transformative use has limits, especially in commercial contexts.
For AI companies, the path forward involves careful consideration of training data sources and potential licensing arrangements. It may also mean being prepared for legal challenges and working proactively with policymakers to develop workable solutions.
For content creators, it’s crucial to stay informed about how your work might be used in AI training. There may be new opportunities for licensing, but also new risks to consider.
For policymakers and courts, the challenge is to strike a balance that fosters innovation while protecting the rights and incentives of creators. This may require rethinking some fundamental aspects of copyright law.
The relationship between AI and copyright is likely to be a defining issue in intellectual property law for years to come. Stay tuned, stay informed, and be prepared for a wild ride.
Continue with Part 3 of this series here.