Fair Use Decision Fumbles Training Analysis but Sends Clear Piracy Message  

On June 23, a district court in the Northern District of California issued an order on summary judgment in Bartz v. Anthropic, addressing fair use as it relates to both generative AI training and how works are collected for training. Finding that generative AI training is “exceedingly” transformative and thus qualifies as fair use, the court glossed over the foremost fourth fair use factor, misinterpreted the Supreme Court’s instruction in Warhol v. Goldsmith, and made numerous other legal errors in its analysis. But while its training analysis misses the mark, the order makes clear that using pirated copies of works to build a “central library” is not fair use and could result in massive damages for willful infringement. Ultimately, the decision could have a significant impact on generative AI infringement litigation where most AI developer defendants have collected training material from the same piracy-laden datasets.

Training Analysis Mistakenly Jumps to Fair Use Conclusion

First analyzing Anthropic’s use of copyrighted works to train its Claude large language model (LLM), Judge William Alsup immediately falls into a fair use trap that the Supreme Court warned against in its seminal 2023 Warhol v. Goldsmith decision, letting his foregone conclusion about the transformative nature of the use control the rest of the fair use analysis. After explaining that the plaintiffs have not alleged that any of Anthropic’s LLM output is infringing, Judge Alsup says that Anthropic’s use of copyrighted works for training is “exceedingly” transformative. But the order provides little explanation of why training is transformative, outside of saying that the purpose of training an LLM is “generate new text” and comparing training to “a reader aspiring to be a writer.”

There are many flaws with the courts first factor analysis, but the most remarkable are: (1) an assumption that human reading or learning from literary works always serves the purpose of generating new text, (2) a complete disregard for the clearly commercial nature of training Anthropic’s LLM, and (3) failing to recognize that when the ultimate purpose of a use is the same as that of the underlying work, a use cannot be found to be transformative (even absent infringing output). The last two points were clearly articulated by the Supreme Court in Warhol when it found that commerciality can offset to some extent a finding of transformativeness and that when the ultimate purpose of copying is the same as the underlying work, the resulting work acts as a substitute and the use therefore cannot be transformative.

The Supreme Court’s key clarification in Warhol of the limited effect a finding of transformativeness has on both factor one and a full fair use analysis was recognized by the U.S. Copyright Office, which, applying the decision to generative AI in its report on training and fair use, explained that the ultimate goal of copying must be considered and that “making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets, especially where this is accomplished through illegal access, goes beyond established fair use boundaries.” Instead of heeding Warhol’s instruction and the Copyright Office’s expertise, Judge Alsup allows his ideas about the transformativeness of generative AI cloud any further analysis. Indeed, his proclamation that generative AI technology is “among the most transformative many of us will see in our lifetime” reveals a rose-colored view of generative AI through which a finding of fair use seems inevitable.

For evidence of such a foregone conclusion, look no further than the court’s cursory factor four analysis. Widely recognized as the most important fair use factor, Judge Alsup devotes one page (or 3% of the decision) to the effect Anthropic’s use for training will have on the potential market for or value of the ingested copyrighted works, concluding that no market is harmed that “the Copyright Act entitles Authors to exploit.” Judge Alsup argues that because there is no specific infringing output at issue, the plaintiffs’ claims of market harm are all but nullified. But that type of reasoning misses the point that a copyright owner’s right of reproduction is violated when copies are made during the training process, and the potential market for and value of the work is harmed by that violation regardless of the infringing or non-infringing nature of any output.

Without explanation, the order then claims that emerging licensing markets for use of copyrighted works for training are not markets that copyright owners are “entitled” to exploit. Judge Alsup seems to believe that any or all licensing markets must exist or must be cognizable at the time of a works creation for that market to be harmed by a later use. That is not what the law says. The law speaks to harm to “potential markets.” Potential future markets not realized at the time of creation are undeniably exploitable by a copyright owner, and the harm to such markets that occurs through unauthorized uses must be considered under the fourth fair use factor. There is an indisputable licensing market for literary works. Perhaps, Judge Alsup would have realized that had he not issued this summary judgment decision well before discovery in the case was complete, which is highly unusual.

Finally, Judge Alsup’s fourth factor analysis includes a confounding comparison that says the plaintiffs’ allegations of market harm are like if they “complained that training schoolchildren to write well would result in an explosion of competing works.” Without going into the myriad ways that AI ingestion and generation of output differs from the human experience, Judge Alsup should understand that the speed, scale, and super-human ability of generative AI to memorize and regurgitate works that supplant those used for training makes such an analogy absurd. Indeed, Judge Alsup’s flawed comparisons to the harms that might result from teaching schoolchildren to write were called out just two days later by fellow Northern District of California Judge Vincent Chhabria in his order on fair use in the Kadrey v. Meta case (which will be the subject of a forthcoming blog), with Chhabria explaining:

[U]sing books to teach children to write is not remotely like using books to create a product that a single individual could employ to generate countless competing works with a miniscule fraction of the time and creativity it would otherwise take. This inapt analogy is not a basis for blowing off the most important factor in the fair use analysis.

Piracy Analysis Hits Closer to the Mark

The second part of the order’s fair use discussion focuses on the digitization and use of lawfully acquired literary works for training. And while that analysis has its own serious flaws, we’ll leave that discussion for another blog and move on to what has the potential to be the most impactful part of the decision: the determination that using pirated works to create a “library” is not fair use, regardless of whether works in the collection are subsequently used for generative AI training, and is likely willful infringement that could expose AI companies to massive statutory damages.

A main argument amongst defendant AI developers who have either admitted or been found to have sourced pirated material for training is that such use cannot be infringing if it serves a subsequent fair use. AI companies point to a few cases that have held that bad faith or “unclean hands” should have no direct impact on a subsequent fair use analysis, but even if that were true, it’s clear that judges overseeing the current AI cases are not accepting the inverse—i.e. that use of a work that ultimately qualifies as fair use creates a backward facing shield that would immunize AI developers for using pirated works to build a database or “central library” that is then used to train a generative AI model.

The order describes how Anthropic downloaded over 7 million copies of pirated books, some of which it would use later to train its LLM, saying that it rejects Anthropic’s assertion that the use of the copies to build a “central library” can be excused as fair use merely because some will eventually be used to train LLMs. Judge Alsup explains:

This order doubts that any accused infringer could ever meet its burden of explaining why downloading source copies from pirate sites that it could have purchased or otherwise accessed lawfully was itself reasonably necessary to any subsequent fair use. There is no decision holding or requiring that pirating a book that could have been bought at a bookstore was reasonably necessary to writing a book review, conducting research on facts in the book, or creating an LLM. Such piracy of otherwise available copies is inherently, irredeemably infringing even if the pirated copies are immediately used for the transformative use and immediately discarded. (emphasis added)

The last point about the irrelevance of immediately discarding copies is a critical one because other AI companies facing lawsuits are likely to argue that, unlike Anthropic, they did not store or otherwise retain any pirated copies. Moreover, the use of copyrighted works for training in other cases, especially those that involve evidence of infringing output or the training of music or image generators, are far less likely to be found to be transformative. It’s therefore difficult to see how any AI company that’s used pirated copies to train their model would be free of any infringement liability, whether that attaches to the act of training or the act of collecting training material.

The order goes on to distinguish Anthropic’s use of pirated copies from many past fair use cases that it claims support a finding of fair use, making clear that all of the cases involve use of a legitimate, not pirated, copy of a work. Judge Alsup explains that no authorized copies existed from which Anthropic made its first copies and that no past fair use cases “fathomed gifting an ‘artificial head start’ to a fair user.”

While there are concerning gaps in the analysis that leave open the question of whether an AI company that, without authorization, collects training material from a dataset that does not contain pirated copies of works would be liable for infringement, the bottom line is clear: drawing training material from notorious internet-based “shadow libraries” of stolen works, as many AI companies facing lawsuits have done, is unequivocally not fair use.

Conclusion

Reporting on the decision, many commentors are chalking it up as a big win for generative AI companies. The reality is far less clear. While the fair use analysis related to training is troubling for copyright owners, the reasoning is flawed on multiple levels and likely to be corrected on appeal. What should be very concerning to AI companies is the court’s conclusion that Anthropic’s collection of pirated copies of works that were subsequently used for training is not fair use. If that same determination is applied widely across similar generative AI infringement cases, AI companies could face staggering damages and maybe, just maybe, realize that legitimately acquiring or licensing works for training is the best path forward.


If you aren’t already a member of the Copyright Alliance, you can join today by completing our Individual Creator Members membership form! Members gain access to monthly newsletters, educational webinars, and so much more — all for free!

get blog updates