OpenAI is pushing back against extensive document disclosure requests in a high-profile copyright lawsuit brought by the Authors Guild, arguing that the scope of discovery demands has become unmanageable. In a letter filed Wednesday in New York federal court, OpenAI lawyer Carolyn M. Homer revealed that files requested from eight additional employees would total over 886,000 documents comprising hundreds of gigabytes of data.
The disputed custodians include some of OpenAI’s most prominent figures, notably former chief scientist and cofounder Ilya Sutskever and researcher Jan Leike, both of whom left the company in May 2024—Leike departing for rival firm Anthropic. Other named custodians include technical staff members Chelsea Voss, Shantanu Jani, Jong Wook Kim, pretraining data lead Qiming Yuan, and former employees Andrew Mayne and Cullen O’Keefe.
The core of the lawsuit centers on allegations that OpenAI trained its AI models on copyrighted books without obtaining authors’ permission. OpenAI has already agreed to produce documents from 24 custodians but is now contesting both the proposed search parameters and requests for files from eight additional custodians, citing concerns about the massive resources required for review.
According to Homer’s filing, reviewing the existing 24 custodians using OpenAI’s proposed search terms would require examining more than 460,000 documents totaling 359 gigabytes. However, using the Authors Guild’s broader proposed terms would balloon this to over 1 million documents. Adding the eight disputed custodians with OpenAI’s search terms would push the total beyond 375 gigabytes, exceeding the volume already agreed upon.
OpenAI’s legal team also highlighted a 71% duplication rate between files from the disputed custodians and the existing 24, suggesting significant overlap that could make the additional review inefficient. The company indicated it would continue negotiating with plaintiffs to reach a more manageable agreement.
This dispute represents the latest development in the class-action lawsuit brought by the Authors Guild, an organization supporting writers’ rights. Previously unsealed documents revealed that OpenAI deleted two datasets—“books1” and “books2”—used to train the older GPT-3 model. Authors Guild lawyers claim these datasets may have contained more than 100,000 published books.
OpenAI faces multiple copyright infringement cases, including a prominent lawsuit from The New York Times, as the AI industry grapples with fundamental questions about training data and intellectual property rights.
Key Quotes
more than 460,000 documents totaling 359 gigabytes
OpenAI lawyer Carolyn M. Homer described the volume of documents the company would need to review from the existing 24 custodians using OpenAI’s own proposed search terms, illustrating the massive scale of discovery in this copyright case.
substantial volume of hits
Homer used this phrase to justify OpenAI’s continued negotiations with plaintiffs, particularly regarding files from Ilya Sutskever and other disputed custodians, arguing that the discovery burden had become excessive.
more than 100,000 published books
Authors Guild lawyers claimed that OpenAI’s deleted ‘books1’ and ‘books2’ datasets may have contained this many published books, highlighting the potential scale of alleged copyright infringement in training GPT-3.
Our Take
This legal battle exposes a fundamental tension in AI development: the need for massive, diverse training datasets versus intellectual property rights. OpenAI’s resistance to discovery requests—particularly involving Sutskever’s files—suggests the company may be concerned about what internal communications could reveal about their awareness of copyright issues. The 71% duplication rate argument appears to be a tactical move to limit discovery scope, but it also inadvertently confirms that multiple senior employees were involved in decisions about training data. The deletion of the ‘books1’ and ‘books2’ datasets raises questions about evidence preservation and whether OpenAI anticipated legal challenges. As AI companies face mounting legal pressure, this case could force the industry toward more transparent data sourcing practices and potentially expensive licensing agreements with publishers and authors, fundamentally changing AI economics.
Why This Matters
This case represents a critical legal battleground that could reshape how AI companies access and use training data. The outcome will likely establish precedents for the entire AI industry regarding copyright compliance, data transparency, and the rights of content creators whose work may have been used without permission.
The involvement of high-profile figures like Ilya Sutskever—one of OpenAI’s cofounders and a leading AI researcher—signals that plaintiffs are targeting individuals with direct knowledge of training data decisions. This strategy could expose internal communications and decision-making processes that reveal how OpenAI approached copyright considerations during model development.
The sheer volume of documents at stake—potentially over 1 million files—demonstrates the complexity and scale of modern AI development, where training datasets can encompass vast amounts of copyrighted material. As AI models become more powerful and commercially valuable, the legal framework governing their training data becomes increasingly important for both AI companies and content creators. This case, along with similar lawsuits from The New York Times and other publishers, will likely influence future AI development practices, licensing agreements, and potentially new legislation governing AI training data rights.
Recommended Reading
For those interested in learning more about artificial intelligence, machine learning, and effective AI communication, here are some excellent resources:
Recommended Reading
Related Stories
- Elon Musk Drops Lawsuit Against ChatGPT Maker OpenAI, No Explanation
- OpenAI CEO Sam Altman Hints at Potential Restructuring in 2024
- OpenAI’s Valuation Soars as AI Race Heats Up
- Sam Altman’s Bold AI Predictions: AGI, Jobs, and the Future by 2025
- Photobucket is licensing your photos and images to train AI without your consent, and there’s no easy way to opt out