A new study suggests that OpenAI’s models have 'memorized' copyrighted content
OpenAI models may have memorized copyrighted content.

A collaborative study led by researchers from the University of Washington, University of Copenhagen, and Stanford University explores potential memorization of copyrighted content by OpenAI's models, notably GPT-4 and GPT-3.5. Using a technique reliant on 'high-surprisal' words—words that are statistically less common—the research indicates memorization occurs when the models accurately predict these obscured words in context. Analysis found that portions of fiction books within the BookMIA dataset, a collection containing samples of copyrighted ebooks, were memorized by the models. Additionally, patterns in memorization were detected at a lower frequency within New York Times articles.
Abhilasha Ravichander, a doctorate candidate at the University of Washington and co-author, underscores the study's importance: "In order to have large language models that are trustworthy, we need to have models that we can probe and audit and examine scientifically." Her statement highlights the broader implication for ethical AI practices calling for data transparency and greater scrutiny in model training protocols.
The contentious debate around copyright and AI models is further fueled as OpenAI faces multiple lawsuits. Rights-holders argue that there is no provision in U.S. copyright law permitting unapproved use of their work as training data. In its defense, OpenAI claims fair use, but plaintiffs argue that specific allowances for AI contexts are lacking in current legislation.
OpenAI's stance promotes the loosening of restrictions governing the use of copyrighted data in AI development. Despite implementing content licensing agreements and opt-out procedures for copyright holders, OpenAI has been active in globally lobbying for fair use provisions tailored to AI advancements. These efforts demonstrate the tension between technological innovation and existing intellectual property frameworks.
The study not only illustrates legal challenges but also impacts ethical considerations in the field of generative AI. As AI models increasingly power transformations across industries, ensuring ethical standards remains crucial. This calls for industry professionals and legislators to establish a balanced approach that respects intellectual property while fostering technological progress.
Sources: University of Washington, University of Copenhagen, Stanford University, OpenAI, TechCrunch