by Lee Gesmer | Feb 17, 2025 | General
The community of copyright AI watchers has been eagerly awaiting the first case to evaluate the legality of using copyright-protected works as training data. We finally have it, and it has a lot of copyright law experts scratching their heads and wondering what it means for the AI industry.
On February 11, 2025, Third Circuit federal appeals court Judge Stephanos Bibas—sitting by designation in the U.S. District Court for the District of Delaware—issued a decision that is likely to shape the future of AI copyright litigation. By granting partial summary judgment to Thomson Reuters Enterprise Centre GmbH (“Thomson Reuters”) against Ross Intelligence Inc. (“Ross”), the court revisited and reversed its earlier 2023 opinion and rejected Ross’s fair use defense. Although this case involves a non-generative AI application, the reasoning has implications for the more than 30 ongoing AI copyright cases currently being litigated.
Case Overview
The Ross litigation centers on allegations that Ross used copyrighted material from Thomson Reuters’ Westlaw—a leading legal research platform—to train its AI-driven legal research tool. Ross wanted to use the Westlaw headnotes to train its AI model, but Thomson Reuters would not grant Ross a license. Instead, Ross commissioned “Bulk Memos” from a third-party provider. These memos, designed to simulate legal questions and answers, closely mirrored Westlaw headnotes—concise summaries that encapsulate judicial opinions. After determining that 2,243 headnotes were substantially similar to the Westlaw headnotes the court held that this was direct copyright infringement and rejected Ross’s fair use defense.
Breaking Down the Fair Use Analysis
The court evaluated the four statutory fair use factors, with two—“purpose and character” and “market effect”—proving decisive:
1 – Purpose and Character of the Use: The court found that Ross’s use was commercial and aimed at developing a product that directly competes with Westlaw. Despite Ross’s argument that its copying was merely an “intermediate step” in a broader process, the judge rejected the intermediate copying cases (discussed below), emphasizing that “Ross was using Thomson Reuters’s headnotes as AI data to create a legal research tool to compete with Westlaw.” Importantly, the court’s analysis was informed by the framework established in the recent Supreme Court decision in Warhol v. Goldsmith, which stressed that reproduction fails to constitute a transformative use if the copying serves a similar market function as the original. The Warhol precedent underlines that transformation requires a “further purpose or different character” from the original work, a requirement Ross did not meet.
2 – Market Effect: The market effect factor proved even more influential. By positioning itself as a direct substitute for Westlaw, Ross both disrupted the existing market and undercut potential licensing markets for Thomson Reuters’s content (notwithstanding that Thomson refused to license to Ross). The court noted that any harm to this market—“undoubtedly the single most important element of fair use”—weighed decisively against Ross.
While the factors addressing the nature of the copyrighted work and the amount used modestly favored Ross, they were insufficient to overcome the adverse findings regarding the purpose of the use and market harm.
The Court’s 2023 Ruling vs. The Current Ruling
It’s worth noting the struggle the judge went through in deciding the fair use issue in this case. Judges rarely reverse themselves on major rulings, but that’s what happened here.
As I noted, the judge in this case had issued a 2023 decision on the fair use issue. There, he held that the question of whether Ross’s use of the West headnotes was fair use to be a jury issue.
In the current decision he reversed himself.
Here’s what the judge said in 2023:
If Ross’s characterization of its activities is accurate, it translated human language into something understandable by a computer as a step in the process of trying to develop a “wholly new,” albeit competing, product—a search tool that would produce highly relevant quotations from judicial opinions in response to natural language questions. This also means that Ross’s final product would not contain or output infringing material. Under Sega [v. Accolade] and Sony [v. Connectix], this is transformative intermediate copying.
And here is what he said in his 2025 decision:
My prior opinion wrongly concluded that I had to send this factor to a jury. I based that conclusion on Sony and Sega. Since then, I have realized that the intermediate-copying cases [Sony, Sega] (1) are computer-programming copying cases; and (2) depend in part on the need to copy to reach the underlying ideas. Neither is true here. Because of that, this case fits more neatly into the newer framework advanced by Warhol. I thus look to the broad purpose and character of Ross’s use. Ross took the headnotes to make it easier to develop a competing legal research tool. So Ross’s use is not transformative. Because the AI landscape is changing rapidly, I note for readers that only non-generative AI is before me today.
This was a major change in direction, and it reflects the challenge the judge perceived in applying copyright fair use to artificial intelligence under the facts in this case.
Implications for Generative AI Litigation
The question on the minds of most copyright AI observers is, “what does this mean for the more than 30 copyright cases against frontier AI model developers—OpenAI, Google, Anthropic, Facebook, X/Twitter, and many others”?
My answer? In most cases, likely not much.
The 2025 Ross decision underscores that even intermediate copying can fall outside fair use when it ultimately facilitates the creation of a product that directly competes with the copyrighted work. For example, unlike Authors Guild v. Google Books, where the transformation involved enabled a unique search function without substituting for the original works, Ross’s use of headnotes was aimed squarely at developing an AI legal research tool that encroaches on Westlaw’s market. This market harm—central to fair use analysis—undermines the fair use defense by establishing that the copying, even if temporary or intermediate, has a direct commercial impact. The ruling aligns with recent precedents like Warhol, which require a truly transformative purpose rather than mere replication, thereby narrowing the scope of permissible intermediate copying in AI training contexts.
However, the case may not have much significance for most of the pending AI copyright cases. While the Ross decision tightens the fair use framework in situations where the end product directly competes with the original work, most current generative AI cases do not involve direct competition. Most generative AI systems produce entirely new content rather than serving as a substitute for the copyrighted materials used during training. As a result, the market harm and competitive concerns central to the Ross ruling may not be as relevant in these cases, and its impact on the broader generative AI landscape may be limited.
Conclusion
The ruling in Thomson Reuters v. Ross Intelligence sets an important precedent for how courts may evaluate the use of copyrighted works in AI training. Although fact-specific and limited to a non-generative AI context, the decision’s reliance on principles from the Warhol case—particularly the need for a transformative purpose and the critical weight of market impact—will likely influence future disputes, including those involving frontier generative AI models, particularly where the AI model competes with the owner of the training data.
Developers and content owners alike should take note: as the legal landscape adapts to the realities of AI, robust data sourcing strategies and a clear understanding of copyright limitations will be crucial. For companies working on generative AI, the challenge will be to innovate without replicating the competitive functions of existing copyrighted works—a balancing act that this decision has now brought into focus.
It’s also important to note that this ruling doesn’t end the case. There are remaining issues of fact that the judge reserved for trial. However, it appears that Ross Intelligence is bankrupt, and therefore may not have the financial resources to continue to trial. And, of course, Ross could appeal the trial judge’s rulings at the conclusion of the case, although it is questionable whether it will be able to do so for the same reason. It seems likely that this case will end here.
Thomson Reuters Enterprise Centre GmbH v. Ross Intelligence Inc. (D. Del. Feb. 11, 2025)
by Lee Gesmer | Feb 1, 2025 | DMCA/CDA
There was a period from roughly 2010 to 2016 when it seemed like I was posting on the DMCA take-down system every few months. Many of these posts focused on the Viacom v. Youtube litigation in the Second Circuit. See here, here, here and here. This massive litigation ended with a settlement in 2014. Nevertheless, before the case settled the Second Circuit issued a significant decision, establishing an important precedent on application of the Digital Millennium Copyright Act.
The Second Circuit’s January 13, 2025 decision in Capitol Records v. Vimeo – written by Judge Pierre Leval, the Second Circuit’s widely acknowledged authority on copyright law – feels like déjà vu. Fifteen years after Capitol Records filed suit the court has reaffirmed and expanded upon the DMCA safe harbor principles it established thirteen years ago in YouTube. Yet the Vimeo decision addresses novel issues that highlight how both technology and legal doctrine have evolved since the YouTube era.
Building on Youtube’s Foundation
In its 2012 decision in Viacom v. YouTube, the Second Circuit ruled that to overcome an internet provider’s DMCA safe harbor protection requires copyright owners to show either that a platform had actual knowledge of specific infringements or that infringement would be “obvious to a reasonable person” – so-called “red flag knowledge.” Generalized awareness that infringement has occurred on a platform wasn’t enough. This framework has served as primary guidance during the explosive growth of user-generated content over the past decade.
Vimeo: New Technology, New Challenges
The Vimeo case presented similar issues but in a transformed technological landscape. Capitol Records asserted that Vimeo lost safe harbor protection because its employees interacted with 281 user-posted videos containing copyrighted music. While YouTube dealt with a nascent video-sharing platform, Vimeo involved a sophisticated service with established content moderation practices.
The court’s analysis of “red flag” knowledge builds on YouTube while providing important new guidance. Employee interaction with content through likes, comments, or featuring videos doesn’t create red flag knowledge. Copyright owners must now prove “specialized knowledge,” and basic copyright training or work experience isn’t enough to establish the expertise needed for this level of knowledge. Even obvious use of copyrighted music doesn’t create red flag knowledge given the complexity of fair use determinations, with the court specifically citing the recent Warhol case where copyright experts split on fair use analysis.
While YouTube focused primarily on knowledge standards, Vimeo tackles a critical question for modern platforms: how much content moderation is too much? The court held that basic curation—like featuring videos in “Staff Picks” or maintaining community standards—won’t strip safe harbor protection. It left open whether more aggressive moderation or encouraging specific types of potentially infringing content might cross the line.
See No Evil, Hear No Evil
However, the decision also creates incentives for platforms to minimize their oversight of copyrighted uploads to avoid triggering red flag liability: by limiting
active monitoring or interaction with user-generated content, platforms can reduce the risk of being deemed to have actual or red flag knowledge of infringement. This has the effect of reinforcing the DMCA’s notice-and-takedown framework as the primary mechanism for addressing copyright infringement. Platforms like Vimeo are likely to choose to rely more heavily on this reactive system rather than implementing robust preemptive measures.
AI and the Future of Safe Harbor
The Vimeo decision leaves open an increasingly important question: how will courts apply these standards as platforms adopt artificial intelligence for content moderation? While the court focused on human knowledge and interaction, modern platforms increasingly rely on automated systems to identify potential infringement. Future litigation will likely need to address whether AI-powered content recognition creates the kind of “specialized knowledge” that might lead to red flag awareness, and whether algorithmic promotion of certain content categories could constitute “substantial influence.”
While Vimeo expands on YouTube’s framework, both cases highlight a fundamental flaw in the DMCA safe harbor: the time and cost of litigation effectively nullifies its protections. YouTube took three years to resolve; Vimeo took fifteen. Without legislative clarity on key terms like “red flag knowledge” and “substantial influence,” copyright owners can continue using litigation costs as a weapon against small and mid-sized platforms—exactly what the DMCA was meant to prevent.
As technology advances, particularly in AI-powered content moderation, platforms must carefully balance robust content management with safe harbor compliance. The Vimeo decision provides valuable guidance while highlighting the need for continued evolution in DMCA safe harbor doctrine.
Capitol Records, LLC v. Vimeo, Inc. (2d Cir. Jan. 13, 2025)
by Lee Gesmer | Dec 30, 2024 | General
I’ve been belatedly reading Chris Miller’s Chip War, so I’m particularly attuned to U.S.-China relations around technology. Of course, the topic of Miller’s excellent book is advanced semiconductor chips, not social media apps. Nevertheless, the topic now occupying the attention of the Supreme Court and the president elect is the national security threat presented by a social media app used by an estimated 170 million U.S. users.
With Miller’s book as background I was interested when, on December 6, 2024, the D.C. Circuit Court of Appeals denied TikTok’s petitions challenging the constitutionality of the Protecting Americans from Foreign Adversary Controlled Applications Act. This statute, which was signed into law on April 24, 2024, mandated that TikTok’s parent company, ByteDance Ltd., divest its ownership of TikTok within 270 days or face a nationwide ban in the United States. The law reflected Congress’s concerns that ByteDance and, by extension, the Chinese government, constituted a national security threat due to concerns about data collection and potential content manipulation.
The effect of the D.C. Circuit’s decision is that ByteDance must divest itself of TikTok by January 19, 2025, the day before the Presidential inauguration.
It took TikTok and ByteDance only ten days – until December 16, 2024 – to file with the Supreme Court an emergency motion for injunction, pending full review by the Court. And it then took the Supreme Court only two days to treat this motion as a petition for a writ of certiorari, allow the petition and put the case on the Supreme Court version of a “rocket docket” – briefing must be completed by January 3, 2025, and the Court will hear oral argument on January 10th, giving it plenty of time to decide the issue in the nine days left until January 19th.
Enter Donald Trump. In a surprising twist, the former president – who initially tried to ban TikTok in 2020 – has filed an amicus brief opposing an immediate ban. He contends that the January 19th deadline improperly constrains his incoming administration’s foreign policy powers, and he wants time to negotiate a solution balancing security and speech rights:
President Trump is one of the most powerful, prolific, and influential users of social media in history. Consistent with his commanding presence in this area, President Trump currently has 14.7 million followers on TikTok with whom he actively communicates, allowing him to evaluate TikTok’s importance as a unique medium for freedom of expression, including core political speech. Indeed, President Trump and his rival both used TikTok to connect with voters during the recent Presidential election campaign, with President Trump doing so much more effectively. . . .
[Staying the statutory deadline] would . . . allow President Trump’s Administration the opportunity to pursue a negotiated resolution that, if successful, would obviate the need for this Court to decide these questions.
Trump Amicus Brief, pp. 2, 9
The legal issues are novel and significant. The D.C. Circuit applied strict scrutiny but gave heavy deference to national security concerns while spending little time on users’ speech interests. Trump raises additional separation of powers questions about Congress dictating national security decisions and mandating specific executive branch procedures.
This case isn’t just about one app. The case reflects deeper tensions over Chinese technological influence, data privacy, and government control of social media. The Court’s decision will likely shape how we regulate foreign-owned platforms while protecting constitutional rights in an interconnected world.
The January 10th arguments – if indeed they go forward on that date, given that president-elect Trump prefers they not – should be fascinating. At stake is not just TikTok’s fate, but precedent for how courts balance national security claims against free speech in the digital age.
________
Addendum:
The highly expedited schedule kept a lot of lawyers busy over the holidays. You can access the docket here. I count 22 amicus briefs, most filed on December 27. Reply briefs are due January 3.
Update: On January 17, 2025 the Court upheld the D.C. Circuit in a per curium decision holding the challenged provisions of the Protecting Americans from Foreign Adversary Controlled Applications Act do not violate petitioners’ First Amendment rights. (link to opinion) On January 20, 2025 President Trump signed an Executive Order instructing the Attorney General not to take any action on behalf of the United States to enforce the Act for 75 days from the date of the Order (link to Order).
by Lee Gesmer | Dec 11, 2024 | Copyright, DMCA/CDA
After reading my 3-part series on copyright and LLMs (start with Part 1, here) a couple of colleagues have asked me whether content owners could use the Digital Millennium Copyright Act (DMCA) to challenge the use of their copyright-protected content.
I’ll provide a short summary of the law on this issue, but the first thing to note is that the DMCA offers two potential avenues for content owners: Section 512(c)‘s widely-used ‘notice and takedown’ system and the lesser-known Section 1202(b)(1), which addresses the removal of copyright management information (CMI), like author names, titles, copyright notices and terms and conditions .
Section 1202(b)(1) – Removal or Alteration of CMI
First, let’s talk about the lesser-known DMCA provision. Several plaintiffs have tried an innovative approach under this provision, arguing that AI companies violated Section 1202(b)(1) by stripping away CMI in the training process.
In November, two federal judges in New York reached opposite conclusions on these claims. In Raw Story Media, Inc. v. OpenAI the plaintiff alleged that OpenAI had removed CMI during the training process, in violation of 1202(b)(1). The court applied the standing requirement established in Transunion v. Ramirez, a recent Supreme Court case that dramatically restricted standing to sue in federal courts to enforce federal statutes. The court held that the publisher lacked standing because it couldn’t prove that it had suffered “concrete harm” from the alleged CMI removal from ChatGPT. The court based this conclusion on the fact that Raw Story “did not allege that a copy of its work from which the CMI had been removed had been disseminated by ChatGPT to anyone in response to any specific query.” Absent dissemination Raw Media had no claim – under Transunion, “no concrete harm, no standing.”
But weeks later, in The Intercept Media v. OpenAI, a different judge issued a short order allowing similar claims to proceed. We are awaiting the opinion explaining his rationale.
The California federal courts have also been unwelcoming to 1202(b)(1) claims. In two cases – Anderson v. Stability AI and Doe 1 v. Gitub the courts dismissed 1202(b)(1) claims on the ground that the removal of CMI requires identicality between the original work and the copy, which the plaintiffs had failed to establish. However, the Github case has been certified for an interlocutory appeal to the Ninth Circuit, and that appeal is worth watching. I’ll note that the identicality requirement is not in the Copyright Act – it is an example of judge-made copyright doctrine.
Section 512(c) – Notice-and-Takedown
While you are likely familiar with the DMCA’s Section 512(c) notice-and-takedown system (think YouTube removing copyrighted videos or music), this law faces major hurdles in the AI context. A DMCA take-down notice must be specific about the location where the infringing material is hosted – typically a URL. In the case of an AI model the challenge is that data used by AI models is not accessible or identifiable, making it impossible for copyright owners to issue takedown notices.
Unsurprisingly, I can’t find any major AI case in which a plaintiff has alleged violation of Section 512(c).
Conclusion
The collision between AI technology and copyright law highlights a fundamental challenge: our existing legal framework, designed for the digital age of the late 1990s, struggles to address the unique characteristics of AI systems. The DMCA, enacted when peer-to-peer file sharing was the primary concern, now faces unprecedented questions about its applicability to AI training data.
Stay tuned.
by Lee Gesmer | Nov 27, 2024 | Copyright
In the first two parts of this series I examined how large language models (LLMs) work and analyzed whether their training process can be justified under copyright law’s fair use doctrine. (Part 1, Part 2). However, I also noted that LLMs sometimes “memorize” content during the training stage and then “regurgitate” that content verbatim or near-verbatim when the model is accessed by users.
The “memorization/regurgitation” issue is featured prominently in The New York Times Company v. Microsoft and OpenAI, pending in federal court in the Southern District of New York. (A reference to OpenAI includes Microsoft, unless the context suggests otherwise). Because the technical details of every LLM and AI model are different I’m going to focus my discussion of this issue mostly on the OpenAI case. However, the issue has implications for every generative LLM trained on copyrighted content without permission.
The Controversy
Here’s the problem faced by OpenAI.
OpenAI claims that the LLM models it creates are not copies of the training data it uses to develop its models, but uncopyrightable patterns. In other words, a ChatGPT LLM is not a conventional database or search engine that stores and retrieves content. As OpenAI’s co-defendant Microsoft explains, “A program called a ‘transformer’ evaluates massive amounts of text, converts that text into trillions of constituent parts, discerns the relationships among all of them, and yields a natural language machine that can respond to human prompts.” (Link, p. 2)
However, the Times has presented compelling evidence that challenges this narrative. The Times showed that it was able to prompt OpenAI’s ChatGPT-4 and Microsoft’s CoPilot to produce lengthy, near-verbatim excerpts from specific Times articles, which the Times then cited in its complaint as proof of infringement.
The Times’ First Amended Complaint includes an exhibit with over 100 examples of ChatGPT “regurgitating” Times content verbatim or near-verbatim in response to specific prompts (copied text highlighted in red; click to enlarge):

This evidence poses a fundamental question: If OpenAI’s models truly transform copyright-protected content into abstract patterns rather than storing it, how can they reproduce exact or nearly exact copies of that content?
The Times argues that this evidence reveals a crucial truth: actual copyrighted expression—not just abstract patterns—is encoded within the model’s parameters. This allegation strikes at the foundation of OpenAI’s legal position and weakens its fair use defense by suggesting its use of copyrighted material is more extensive and less transformative than claimed.
Just how big a problem this is for OpenAI and the AI industry is difficult to determine. I’ve tried to replicate it in a variety of cases on ChatGPT and several other frontier models without success. In fact I can’t get the models to give me the text of Moby Dick, Tom Sawyer or other literary works whose copyrights have long expired.
Nevertheless, the Times was able to do this one hundred times, and it’s safe to assume that it could have continued well past that number, but thought that 100 examples was enough to make the point in its lawsuit.
OpenAI: “The Times Hacked ChatGPT”
What’s OpenAI’s response to this?
To date, OpenAI and Microsoft have not filed answers to the Complaint. However, they have given an indication of how they view these allegations in partial motions to dismiss filed by both companies.

Microsoft’s motion (p. 2) argues that the NYT’s methods to demonstrate how its content could be regurgitated did not represent real-world usage of the GPT tools at issue. “The Times,”it argues, “crafted unrealistic prompts to try to coax the GPT-based tools to output snippets of text matching The Times’s content.” (Emphasis in original) To get the NYT content regurgitated, a user would need to know the “genesis of that content.” “And in any event, the outputs the Complaint cites are not copies of works at all, but mere snippets” that do not rise to the level of copyright infringement.
OpenAI’s motion (p. 12.) argues that the NYT “appears to have [used] prolonged and extensive efforts to hack OpenAI’s models”:
In the real world, people do not use ChatGPT or any other OpenAI product for that purpose, … Nor could they. In the ordinary course, one cannot use ChatGPT to serve up Times articles at will. . . . The truth, which will come out in the course of this case, is that the Times paid someone to hack OpenAI’s products. It took them tens of thousands of attempts to generate the highly anomalous results that make up Exhibit J to the Complaint. They were able to do so only by targeting and exploiting a bug (which OpenAI has committed to addressing) by using deceptive prompts that blatantly violate OpenAI’s terms of use. And even then, they had to feed the tool portions of the very articles they sought to elicit verbatim passages of, virtually all of which already appear on multiple public websites. Normal people do not use OpenAI’s products in this way.
It appears that OpenAI is referring to the provisions in its Terms of Service that prohibit anyone from “Us[ing] our Services in a way that infringes, misappropriates or violates anyone’s rights” or to “extract data.” OpenAI has labeled these “adversarial attacks.”
Copyright owners don’t buy this tortured explanation. As OpenAI has admitted in a submission to the Patent and Trademark Office, “An author’s expression may be implicated . . . because of a similarity between her works and an output of an AI system.” (link, n. 71, emphasis added).
Rights holders claim that their ability to extract memorized content from these systems puts the lie to, for example, OpenAI’s assertion that “an AI system can eventually generate media that shares some commonalities with works in the corpus (in the same way that English sentences share some commonalities with each other by sharing a common grammar and vocabulary) but cannot be found in it.” (link, p. 10).
Moreover, OpenAI’s “hacking” defense would seem to inadvertently support the Times’ position. After all, you cannot hack something that isn’t there. The very fact that this content can be extracted, regardless of the method, suggests it exists in the form of an unauthorized reproduction within the model.
OpenAI: “We Are Protected by the Betamax Case”
How will OpenAI and Microsoft respond to these allegations under copyright law?
To date, OpenAI and Microsoft have yet to file formal answers to the Times’ complaint. However, they have given us a hint of their defense strategy in their motions to dismiss, and it is based in part on the Supreme Court’s 1984 decision in Sony v. Universal City Studios, a case often referred to as “the Betamax case.” 
In the Betamax case a group of entertainment companies sued Sony for copyright infringement, arguing that consumers used Sony VCRs to infringe by recording programs broadcast on television. The Supreme Court held that Sony could not be held contributorily liable for infringements committed by VCR owners. “[T]he sale of copying equipment . . . does not constitute contributory infringement if the product is . . . capable of substantial noninfringing uses.”
The take-away from this case is that under copyright law if a product can be put to either a legal or illegal purpose by end-users (a “dual-use”), it is not infringing so long the opportunity for noninfringing use is substantial.
OpenAI and Microsoft assert that the Betamax case applies because, like the VCR, ChatGPT is a “dual-use technology.” While end users may be able to use “adversarial prompts” to “coax” a model to produce a verbatim copy of training data, the system itself is a neutral, general-purpose tool. In most instances it will be put to a non-infringing use. Citing the Betamax case Microsoft argues that “copyright law is no more an obstacle to the LLM than it was to the VCR (or the player piano, copy machine, personal computer, internet, or search engine)”—all dual-use technologies.
No doubt, in support of this argument OpenAI will place strong emphasis on ChatGPT’s many important non-infringing uses. The model can create original content, analyze public domain texts, process user-provided content, educate, generate software code, and more.
However, OpenAI’s reliance on the Betamax dual-use doctrine faces a challenge central to the doctrine itself. The Betamax case was based on secondary liability—whether Sony could be held responsible for consumers using VCRs to record television programs. The alleged infringements occurred through consumer action, not through any action taken by the device’s manufacturer.
But with generative LLMs such as ChatGPT the initial copying happens during training when the model memorizes copyrighted works. This is direct infringement by the AI company itself, not secondary infringement based on user prompts. When an AI company creates a model that memorizes and can reproduce copyrighted works, the company itself is doing the copying—making this fundamentally different from Betamax.
Before leaving this topic it’s important to note that the full scope of memorization within AI models of GPT-4’s scale may be technically unverifiable. While the models’ creators can detect some instances of memorization through testing, due to the scale and complexity of the models they cannot comprehensively examine their internal representations to determine the full extent of memorized copyrighted content. While the Times’ complaint demonstrated one hundred instances of verbatim copying, this could represent just the tip of the iceberg, or conversely, the outer limit of the problem. This uncertainty itself poses a significant challenge for courts attempting to apply traditional copyright principles.
Technical Solutions
While these legal issues work their way through the courts, AI companies aren’t standing still. They recognize that their long-term success may depend on their ability to prevent or minimize memorization, regardless of how courts ultimately rule on the legal issues.
Their approaches to this challenge vary. OpenAI has told the public that it is taking measures to prevent the types of copying illustrated in the Times’ lawsuit: “we are continually making
our systems more resistant to adversarial attacks to regurgitate training data, and have already made much progress in our recent models.” (link) This includes filtering or modifying user prompts to reject certain requests before they are submitted as prompts to the model and aligning the models to refuse to produce certain types of data. Try asking ChatGPT to give you the lyrics to Arlo Guthrie’s “Alice’s Restaurant Massacree” or Taylor Swift’s “Cruel Summer.” It will tell you that copyright law prohibits it from doing so.
And, it’s important to note that different AI companies are taking different approaches to this problem. For example, Google (which owns Gemini) uses supervised fine tuning (explained here). Anthropic (which owns Claude) focuses on what it calls “constitutional AI” – a training methodology that builds in constraints against certain behaviors, including the reproduction of copyrighted content. (link here). Meta (LLaMA models) has implemented what it calls “deduplication” during the training process – actively removing duplicate or near-duplicate content from training data to reduce the likelihood of memorization. Additionally, Meta has developed techniques to detect and filter out potential memorized content during the model’s response generation phase. (link here).
Conclusion
The AI industry faces a fundamental challenge that sits at the intersection of technology and law. Current research suggests that some degree of memorization may be inherent to large language models – raising a crucial question for courts: If memorization cannot be eliminated without sacrificing model performance, how should copyright law respond?
The answer could reshape both AI development and copyright doctrine. AI companies may need to accept reduced performance in exchange for legal compliance, while content creators must decide whether to license their works for AI training despite the risk of memorization. The industry’s ability to develop systems that truly learn patterns without memorizing specific expressions – or courts’ willingness to adapt copyright law to this technological reality – may determine its future.
The outcome of the Times lawsuit may establish crucial precedents for how copyright law treats AI systems that can memorize and reproduce protected content. At stake is not just the legality of current AI models, but the broader question of how to balance technological innovation with the rights of content creators in an era where the line between learning and copying has become increasingly blurred.