Researchers propose that OpenAI trained its AI models using paywalled books from O'Reilly
OpenAI may have used unauthorized O'Reilly books to train its AI models.

Researchers from the AI Disclosures Project, a nonprofit organization co-founded in 2024 by Tim O'Reilly and Ilan Strauss, claim that OpenAI’s GPT-4o model has been trained using paywalled books from O'Reilly Media, without a licensing agreement in place. The paper contrasts GPT-4o's knowledge with that of earlier models, such as GPT-3.5 Turbo, to suggest that the newer model recognizes more content from O'Reilly’s books, which are not publicly available.
The methodology employed by the researchers is DE-COP, a technique also known as a membership inference attack, which tests a model’s ability to distinguish between human-authored and AI-generated text versions. The implication of this finding is that GPT-4o likely has had prior exposure to a considerable amount of copyrighted material. However, the researchers admit that this approach is not foolproof and allow for the possibility that data may have been indirectly obtained.
Further complicating the findings is the fact that the researchers did not evaluate OpenAI's most recent models, like GPT-4.5 and others. These newer iterations might not rely on or rely less on O'Reilly's paywalled data compared to GPT-4o. Despite these uncertainties, this revelation supports ongoing discussions about AI companies’ hunger for high-quality curated data and OpenAI’s stance on copyright restrictions.
On OpenAI’s part, the company has been transparent about acquiring data through various legal agreements with publishers and other entities, offering mechanisms for copyright owners to opt-out, though these are not flawless. Nonetheless, OpenAI is already embroiled in multiple lawsuits challenging its data collection practices in U.S. courts, thus bringing its use of copyrighted content under increased scrutiny.
The broader context of this development includes the tech industry’s trend towards employing domain experts to enrich AI models with specialized knowledge. OpenAI’s actions illustrate a prevailing dilemma between the need for expansive training data and adherence to copyright laws. The allegations from the O'Reilly paper add a new dimension to the ethical considerations surrounding AI model training practices.
Sources: AI Disclosures Project, Tim O'Reilly, Ilan Strauss, Sruly Rosenblat