“Move fast and break things.” Mark Zuckerberg’s famous motto seems especially apt when examining how Meta developed Llama, its flagship AI model.
Like OpenAI, Google, Anthropic, and others, Meta faces copyright lawsuits for using massive amounts of copyrighted material to train its large language models (LLMs). However, the claims against Meta go further. In Kadrey v. Meta, the plaintiffs allege that Meta didn’t just scrape data — it pirated it, using BitTorrent to pull hundreds of terabytes of copyrighted books from shadow libraries like LibGen and Z-Library.
This decision could significantly weaken Meta’s fair use defense and reshape the legal framework for AI training-data acquisition.
BitTorrent’s Double-Edged Sword
BitTorrent is a peer-to-peer file-sharing protocol that efficiently distributes large files by breaking them into small pieces and sharing them across a decentralized “swarm” of users. Once a user downloads a piece, they immediately begin uploading it to others—a process known as “seeding.”
BitTorrent is content-neutral, and it powers many legitimate projects, such as the distribution of open source software. But it is also the lifeblood of piracy networks. Courts have long treated unauthorized BitTorrent traffic as textbook copyright infringement. See, e.g., Glacier Films v. Turchin (9th Cir. 2018).
In Kadrey v. Meta, plaintiffs allege that discovery has revealed that Meta’s GenAI team pivoted from tentative licensing discussions with publishers to mass BitTorrent downloading after receiving internal approvals that allegedly escalated “all the way to MZ”—Mark Zuckerberg.
The plaintiffs allege that Meta engineers, worried that BitTorrent “doesn’t feel right for a Fortune 500 company,” nevertheless torrented 267 terabytes between April and June 2024—roughly twenty Libraries of Congress worth of data—including the entire LibGen non-fiction archive, Z-Library’s cache, and massive swaths of the Internet Archive. According to the plaintiffs’ forensic analysis, Meta’s servers re-seeded the files back into the swarm. In short, the company didn’t just scrape the literary internet, it behaved like a super‑seeder, redistributing mountains of pirated works.
Why BitTorrent Matters Legally
Meta’s alleged use of BitTorrent complicates its copyright defense in three major ways:
- Reproduction and Distribution Liability. Most LLM training involves copying copyrighted works, which defendants typically argue is protected as fair use. But BitTorrent introduces unauthorized distribution under § 106(3) of the Copyright Act. Even if the court finds Llama’s training to be fair use, unauthorized seeding could constitute a separate violation harder to defend as transformative.
- Willfulness and Statutory Damages. Internal communications allegedly showed engineers warning about the legal risks, describing the pirated sources as “dodgy,” and joking about torrenting from corporate laptops. Plaintiffs allege that Meta ran the jobs on Amazon Web Services rather than Facebook servers, in a deliberate effort to make the traffic harder to trace back to Menlo Park. If proven, these facts could support a finding of willful infringement, exposing Meta to enhanced statutory damages of up to $150,000 per infringed work.
- “Unclean Hands” and Fair Use. The plaintiffs argue that the method of acquisition matters. They point to Harper & Row v. Nation Enterprise (1985), where the Supreme Court found that bad faith acquisition—stealing Gerald Ford’s manuscript—undermined the defendant’s fair use defense. Plaintiffs argue that torrenting from pirate libraries is today’s equivalent of exploiting a purloined manuscript.
Meta’s Fair Use Defense—and Its Vulnerabilities
Meta argues that its use of the plaintiffs’ books is transformative: it extracts statistical patterns, not expressive content. Meta relies on Authors Guild v. Google Books (2nd Cir. 2015) and emphasizes that fair use focuses on how a work is used, not obtained. Meta claims that its engineers took steps to minimize seeding—however the internal data logs that would prove this are missing, leaving Meta unable to prove this assertion.
Meta also frames Llama’s outputs as new, non-infringing content—asserting that bad faith, even if proven, should not defeat fair use.
However, the plaintiffs counter that Llama differs from Google Books in key respects:
– Substitution risk: Llama is a commercial product capable of producing long passages that may mimic authors’ voices, not merely displaying snippets.
– Scale: The amount of copying—terabytes of entire book databases—dwarfs that upheld in Google Books.
– Market harm: Plaintiffs argue that licensing markets for AI training datasets are emerging, and Meta’s decision to torrent pirated copies directly undermines that market.
Moreover, courts have routinely rejected defenses based on the idea that pirated material is “publicly available.” Downloading infringing content over BitTorrent has never been viewed kindly—even when defendants claimed to have good intentions.
Why Torrenting Might Sink Meta’s Defense
Even if Meta persuades the court that its training of Llama is transformative, the torrenting evidence remains a serious threat. Here’s why:
– Unauthorized distribution: The automatic seeding function of BitTorrent means Meta likely distributed copyrighted material, independent of its later transformative use.
– Bad faith optics: Jokes about piracy, euphemisms describing pirated archives as “public” datasets, and efforts to conceal traffic through AWS servers present a damaging narrative.
– Evidence destruction: The deletion of torrent logs may support an adverse inference that distribution occurred.
– Judicial pragmatism: Federal District Court Judge Vince Chhabria has already suggested that AI training raises unique legal challenges. Faced with such novelty, the court might prefer to decide the case on familiar grounds—traditional copyright infringement through unauthorized reproduction and distribution—rather than attempting to set sweeping precedent on AI fair use.
The Broader Stakes
If the court rules that unlawful acquisition taints subsequent transformative uses, the AI industry will face a paradigm shift. Companies will need to document clean sourcing for training datasets—or face massive statutory damages.
If Meta prevails, however, it may open the door for more aggressive data acquisition practices: anything “publicly available” online could become fair game for AI training, so long as the final product is sufficiently transformative.
Regardless of the outcome, the record in Kadrey v. Meta is already reshaping AI companies’ risk calculus. “Scrape now, pay later” is beginning to look less like a clever strategy and more like a legal time bomb.
Conclusion
BitTorrent itself isn’t on trial in Kadrey v. Meta, but its DNA lies at the center of the dispute. For decades, most fair use battles have focused on how a copyrighted work is exploited. This case asks a new threshold question: does how you got the work come first?
The answer could define how the next generation of AI is built.