ER 39: A Tale of Two Copyright Cases
Or, is fair use fair?
Welcome back to the Ethical Reckoner! I promised a few weeks ago that I’d talk about the two big rulings on using books to train LLMs, but I realized that it demanded more than an Extra Reckoning, so today we’ve got a whole issue dedicated to breaking down the two rulings (plus Copyright Law 101). Though copyright sounds dry, this is going to be one of the defining legal issues of the AI age, so if you’re a creative, know a creative, or want to be a creative, this one’s for you.
This edition of the WR is brought to you by… my Away suitcase. Not sponsored, I’m just delighted that I managed to fit seven days of business casual in a carry-on.




What is copyright?
Let’s start with the basics. The goal of copyright is to protect “original works of authorship.” It gives the author of a work the right to reproduce, distribute, make derivative works, perform, and display the work, and lasts for the life of the author plus 70 years after the author’s death. However, there are cases where it’s ok to use a work that you don’t have the copyright to. These are governed by fair use.
What is fair use?
While the government has an interest in protecting authors’ rights to their work, it also has an interest in making sure those works are able to be used and built upon without having to acquire the copyright.
If you’ve ever watched a YouTube reaction or compilation video, you’ve likely seen a caption saying something along the lines of “Fair Use!! No Copyright Infringing Intended!!1!!” Shockingly, that’s not enough to make something fair use. Fair use is governed by four factors laid out in Section 107 of the Copyright Act:
the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes; [nb: this is often called “transformative” use]
the nature of the copyrighted work; [nb: some works (like unpublished ones) have less copyright protection than others]
the amount and substantiality of the portion used in relation to the copyrighted work as a whole; [nb: using more of a work requires meeting a higher threshold] and
the effect of the use upon the potential market for or value of the copyrighted work. [nb: if a copy replaces a work in a given market, it’s a higher bar to fair use]
These four factors are looked at holistically, so whether each one favors the plaintiff or defendant and to what extent is considered in ruling whether something is fair use.
What happened in the Anthropic case?
The issue at hand:
Anthropic pirated millions of books and scanned millions more to create a digital library of books. It then used this library1 to train its LLMs that serve the Claude chatbot. The plaintiff authors in the case alleged that two things—piracy to build the library and reproducing the texts to train its LLMs—violated their copyrights (based mostly on factors 1-3). Anthropic argued that the piracy “was justified because all those copies were at least reasonably necessary for training LLMs” and that using copyrighted text for training LLMs is fair use.
Judge Alsup ruled that:
Anthropic using books to train LLMs is fair use, because “the technology at issue was among the most transformative many of us will see in our lifetimes.”
Digitizing physical books Anthropic purchased is fair use, because it’s essentially just a switch from format and no additional copies were made.
Anthropic pirating books to build a central library was not fair use, because even if you’re going to use them for something fair use, you aren’t entitled to a copy for free.
Live issues:
How much will Anthropic owe in damages?
This could be… substantial. The ruling opened the door to a class action lawsuit. Damages depend on how many works are ruled to be in the class (not all of the books will have been published in the US or registered with the US Copyright Office), but given that it’s likely that there are at least a few million eligible books, damages could easily top $1 billion… at the lowest possible level of statutory damages. If the court rules there was “willful infringement,” damages could approach a trillion dollars at the highest possible end. Anthropic is valued at $100 billion, makes $3 billion in revenue annually, and would have to have cash on hand for damages.

For authors who don’t want their works used to train AI at all, this ruling wasn’t great news, although there’s the silver lining that if AI companies want to obtain works to train LLMs, they have to do so legally—note, however, that this wasn’t a ruling on web scraping. Another silver lining for creatives is that essentially every author in the US may now be eligible for a class-action settlement that might actually put Anthropic out of business. The judge seemed fairly pro-LLM, but anti-Anthropic/its piracy, so how much he’ll bring the hammer down on Anthropic is hard to predict.
What happened in the Meta case?
The issue at hand:
Meta also pirated millions of books and used them to train their LLMs, the Llama family. The authors argued that the market for their works was impacted (factor 4) because Llama can reproduce small snippets of text from their works and that by using their works without permission, Meta has precluded them from licensing their works for training LLMs.
Judge Chhabria ruled that:
The models don’t reproduce enough snippets to matter commercially.
The authors aren’t “entitled to the market for licensing their works as AI training data.”
Thus, downloading the shadow archives to train Llama was fine in this case because it’s so transformative and has limited impact in terms of loss of the sale of individual books to Meta.
However, the court notes that if the authors could argue market impacts more strongly, the balance of factors might change.
Live issues:
Meta torrented the works it obtained form the shadow libraries—torrenting is a decentralized file distribution method where files are split into chunks and stored on the servers of users who download that file, making it faster to distribute large files—which means that they likely re-uploaded some of the data they downloaded,2 which opens them up to additional copyright claims.
Copyright cases arguing for market dilution based on LLMs being able to create works that could potentially replace those by human authors.
So on first glance, this case doesn’t seem to be great for authors… but it’s because the authors’ lawyers made the wrong arguments. The judge explicitly says in the opening of his ruling:
“companies have been unable to resist the temptation to feed copyright-protected materials into their models—without getting permission from the copyright holders or paying them for the right to use their works for this purpose. This case presents the question whether such conduct is illegal. Although the devil is in the details, in most cases the answer will likely be yes.” [emphasis added]
In this case, the answer was no, and the ruling lays out explicitly why: the lawyers made the wrong argument, and should have focused on market dilution:
“This ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful. It stands only for the proposition that these plaintiffs made the wrong arguments and failed to develop a record in support of the right one.”
Judge Chhabria also explicitly takes aim at Judge Alsup’s interpretation of LLM learning in his ruling, saying that there’s a huge difference between human learning and the process of downloading and curating data to train an LLM. Chhabria seems much more reluctant than Alsup to accept at face value that such “groundbreaking technology” must be allowed to develop unfettered:
“No matter how transformative LLM training may be, it’s hard to imagine that it can be fair use to use copyrighted books to develop a tool to make billions or trillions of dollars while enabling the creation of a potentially endless stream of competing works that could significantly harm the market for those books.”
This is a spicy legal ruling, and indicates that future cases in front of Chhabria might not go the way of the AI companies.
Neither of these cases are the final word on copyright and LLM training. Also, both of these cases ruled on the legality of using books for LLM training—in particular cases based on particular fact patterns—but not the morality of it. The fundamental issue at hand is still that these technologies that generate billions of dollars in revenue are trained on the works of authors and artists who have not been compensated. While it might be fair use, it might not be fair, period—as Judge Chhabria noted.
Do I think that artists deserve compensation for their contributions to these hugely profitable technologies? Generally, yes, just like I think that Henrietta Lacks’s family deserved compensation for the incredibly profitable cell line begun without her consent. But we may not be able to rely on copyright to establish a just system to ensure that authors get compensation. Will it ultimately be companies striking licensing deals? Given that both Meta and Anthropic started to try and then gave up, probably not. Will it be the US government passing a new framework to compensate authors? Given that Trump gave a speech where he said:
“You can't be expected to have a successful AI program when every single article, book, or anything else that you've read or studied, you're supposed to pay for. Gee, I read a book, I'm supposed to pay somebody.”
…probably not.
So then what are we left with? Perhaps future copyright rulings will favor the authors in a way that establishes fair compensation. Or maybe companies will start to license works, or come up with revenue-sharing schemes as a result of lawsuits or public pressure. But honestly, I’m not sure. Watch this space.
Anthropic eventually excluded the pirated copies from training, but kept the copies.
The ruling is a bit confusing here. Commonly, “seeding” is defined as uploading/distributing content. “Leeching” is downloading without uploading. The ruling says that Meta used “a script to prevent seeding, but apparently not leeching.” To prevent leeching would be to prevent downloading altogether. However, I think there’s some confusion in terminology. The ruling says “This reuploading can occur both while files are still being downloaded (which the parties refer to as ‘leeching’) and after those files have been fully downloaded (which the parties refer to as ‘seeding’).” It seems like perhaps the court added the idea of uploading to the term “leeching” instead of acknowledging that leeching and seeding can occur simultaneously—or I’m just misunderstanding and overcomplicating this. Anyway, the point seems to be that Meta may have distributed chunks of books while they were downloading, but not after they were already downloaded.


Author here. Like many -- make that most -- authors, I don't make a ton of money from my books. Also, I don't use AI to write them, though technically I use what is now referred to as AI to spell and grammar check them. And it burns me up that big companies -- and other authors who do use AI to create their work -- are using my work and making money off it without my permission or compensating me. And I disagree with Judge Alsup's ruling. It's one thing to purchase a single book and share or give it to a friend -- or have it at a lending library. It is quite another to digitally copy it and share it with hundreds or thousands or millions of people. Pretty sure if Judge Alsup had written a book, on his own, that he had spent months or years and thousands of dollars on, he wouldn't be pleased with some company ripping it off.