Court documents reveal that Anthropic destroyed millions of books to train its AI

Anthropic's AI training involved destroying millions of books, leading to a pivotal legal ruling in their favor.

: Anthropic, an AI company, employed a controversial method of destroying millions of physical books by scanning their pages to train its ChatGPT competitor, Claude. This controversial practice helped it win a legal battle under the fair use defense, as the digital transformation was deemed sufficient. However, the company still faces a copyright trial regarding the use of pirated materials, potentially incurring fines up to $150,000 per work. Challenges in AI's use of copyrighted content persist, with ongoing legal actions in the industry highlighting the complexity of the issue.

Court documents have revealed that Anthropic spent millions of dollars purchasing and destroying physical books to train its AI models. In early 2024, the company hired Tom Turvey, formerly of Google Books, with the mission to "collect every book on Earth." Under his direction, Anthropic acquired millions of second-hand books, cut off their bindings, scanned each page into digital files, and discarded the physical copies. These scanned PDFs were then used to train Claude, Anthropic’s AI assistant.

Judge William Alsup ruled that digitizing and destroying the physical copies constituted a “transformative use,” akin to reformatting for storage efficiency, and was therefore protected under the U.S. fair use doctrine — provided the digital versions were not redistributed. Anthropic argued that destroying the originals reinforced their intent to use the books solely for internal AI model training.

However, the court also found that Anthropic had sourced over 7 million books from pirate sites like LibGen and Pirate Library Mirror, which is not protected by fair use. As a result, that portion of the case will go to trial in December. If found guilty, Anthropic could face damages of up to $150,000 per infringed work.

Documents further revealed that Anthropic initially attempted to negotiate licenses with publishers, but ultimately turned to second-hand book markets due to cost and speed. The process reportedly cost several million dollars, and the books were acquired mainly from major resellers — avoiding rare or limited-edition titles.

In summary, the court’s decision split the issue: legally purchased books that were scanned and destroyed fell under fair use, while the use of pirated content likely violated copyright law. The outcome of the December trial may set a major precedent for how printed books can be used in AI training — with potential ripple effects for OpenAI, Microsoft, Meta, and others in similar lawsuits.

Sources: The Guardian, Washington Post, TechCrunch, Ars Technica, Reuters