Mass Law Blog

SDNY Courts Split Over Copyright Management Information in AI Cases

by | Mar 17, 2025

In my recent post—Postscript to my AI Series – Why Not Use the DMCA?—I discussed early developments in two cases pending against OpenAI in the U.S. District Federal District Court for the Southern District of New York (SDNY). Both cases focus on the claim that in the process of training its AI models, OpenAI illegally removed “copyright management information.” And, as I discuss below, they reach different outcomes.

What Is Copyright Management Information?

Many people who are familiar with the Digital Millennium Copyright Act’s (DMCA) “notice and takedown” provisions are unfamiliar with a part of the DMCA that makes it illegal to remove “copyright management information,” or “CMI.”

CMI includes copyright notices, information identifying the author, and details about the terms of use or rights associated with the work. It can be visible directly on the work, or metadata in the underlying code. 

The CMI removal statuteSection 1202(b)(1) of the DMCA—is a “double scienter” law, requiring that a plaintiff prove that (1) CMI was intentionally removed from a copyrighted work, and (2) that the alleged infringer knew or had reasonable grounds to know that the removal of CMI would “induce, enable, facilitate, or conceal” copyright infringement.

Here is an example of how this law might work. 

Assume that I have copied a work and that I have a legitimate fair use defense. However, assume further that I duplicated the work, removed the copyright notice and published the work without it. I have a fair use defense as to duplication and distribution, but could I still be liable for CMI removal?

The answer is yes. A violation of the DMCA is independent of my fair use defense. And, the penalty is not trivial. Liability for CMI removal can result in statutory damages ranging from $2,500 to $25,000 per violation, as well as attorneys’ fees and injunctive relief. Moreover, unlike infringement actions, a claim for CMI removal does  not require prior registration of the copyright.

All of this adds up to a powerful tool for copyright plaintiffs, a fact that has not been lost on plaintiffs’ counsel in AI litigation.  

CMI – Why Don’t AI Companies Want To Include It?

AI companies’ removal of CMI during training stems from both technical necessities and strategic considerations. From a technical perspective, large language model training requires standardized data preparation processes that typically strip metadata, formatting, and peripheral information to create uniform training examples. This preprocessing is fundamental to how neural networks learn from text—they require clean, consistent inputs to identify linguistic patterns effectively.

The computational overhead is also significant. Preserving and processing CMI for billions of training examples would increase storage requirements and computational costs. AI companies argue that this additional information provides minimal benefit to model performance while significantly increasing training complexity.

Content owners, however, contend that these technical justifications mask more strategic motivations. They argue that AI companies deliberately eliminate attribution information to obscure the provenance of training data, making it difficult to detect when copyrighted material has been incorporated into models. This removal, they claim, facilitates a form of “laundering” copyrighted content through AI systems, where original sources become untraceable.

More pointedly, content creators assert that CMI removal directly enables downstream infringement by making it impossible for users to identify when an AI output derives from or reproduces copyrighted works. Without embedded attribution information, neither the AI company nor end users can properly credit or license content that appears in generated outputs.

The technical reality and legal implications of this process sit at the heart of these emerging cases, with courts now being asked to determine whether standard machine learning preprocessing constitutes intentional CMI removal under the DMCA’s “double scienter” standard.

Raw Story Media v. OpenAI

In the first of the two SDNY cases, Raw Story Media v. OpenAI, federal district court judge Colleen McMahon dismissed Raw Story’s claim that when training ChatGPT, OpenAI had illegally removed CMI.  

At the heart of Judge McMahon’s decision was her observation that although OpenAI removed CMI from Raw Story articles, Raw Story was unable to allege that the works from which CMI had been removed had ever been disseminated by ChatGPT to anyone. On these facts, Judge McMahon held that Raw Story lacked standing under the  Article III standing principles established by the Supreme Court in Transunion v. Ramirez (2021). It’s worth noting her observation that “the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs’ articles seems remote” based on “the quantity of information contained in the [AI model].”

The Intercept Media v. OpenAI

In the second case, The Intercept Media v. OpenAI, The Intercept made the same allegation. It asserted that OpenAI had intentionally removed CMI—in this case authors, copyright notices, terms of use and title information—from its AI training set.  

However, in this case Judge Jed Rakoff came to the opposite conclusion. In November 2024 he issued a bottom-line order declining to dismiss  plaintiff’s CMI claim and stated that an opinion explaining his rationale would be forthcoming.

That opinion was issued on February 20, 2025.  

At this early stage of the case (before discovery or trial) the judge found that The Intercept met the “double scienter” standard. As to the first part of the test, The Intercept alleged that the algorithm that OpenAI uses to build its AI training data sets can only capture an article’s main text, which excludes CMI. This satisfied the intentional removal element.

As to the second component of the standard, the court was persuaded by The Intercept’s theory of “downstream infringement,” which argues that OpenAI’s model might enable users to generate content based on The Intercept’s copyrighted works without proper attribution. And importantly, unlike in Raw Story, The Intercept was able to provide examples of verbatim regurgitation of its content from ChatGPT based on prompts from The Intercept’s data scientist.

The district court held that a copyright injury “does not require publication to a third party,” finding unpersuasive OpenAI’s argument that the Intercept failed to demonstrate a concrete injury because it had not conclusively established that users had actually accessed The Intercept’s articles via ChatGPT.

Curiously, Judge Rakoff’s decision failed to mention the earlier ruling in Raw Story Media, Inc. v. OpenAI, where Judge McMahon held, on similar facts, that the plaintiffs lacked standing to assert removal of CMI claims. Both cases were decided by SDNY district court judges. However, unlike the ruling in Raw Story Media Judge Rakoff concluded that The Intercept’s alleged injury was closely related to the property-based harms typically protected under copyright law, satisfying the Article III standing requirement.  

Thus, while Raw Story’s CMI claims against OpenAI have been dismissed, The Intercept’s CMI removal case against OpenAI will proceed.