New York Times Sues Microsoft and OpenAI Over Training Data

The Gray Lady's lawsuit could reshape the legal framework for AI training, with billions of dollars in potential damages and industry-wide implications.

The New York Times building in Manhattan. The paper's lawsuit against OpenAI and Microsoft is the most high-profile test of copyright law in the AI era. (Photo: Unsplash)

The New York Times Company filed a landmark copyright infringement lawsuit against OpenAI and Microsoft in the U.S. District Court for the Southern District of New York, alleging that the two companies used millions of Times articles — without permission or payment — to train the artificial intelligence models that power ChatGPT, Copilot, and a growing ecosystem of AI-driven products. The complaint, running 69 pages, seeks billions of dollars in statutory and actual damages and asks the court to order the destruction of any AI models or training datasets that incorporate Times content.

The case is the most consequential legal challenge to date in the rapidly evolving collision between generative AI and intellectual property law. While dozens of authors, artists, and smaller publishers have filed similar suits, the Times brings a unique combination of institutional heft, legal resources, and cultural authority. It is also the first major publisher to explicitly reject the licensing approach that OpenAI has pursued with other media companies, framing the case not as a negotiation tactic but as a fundamental fight over whether AI companies can build multi-billion-dollar businesses on the backs of other people's work.

"This lawsuit is about the most basic principle of copyright: that people and institutions that create original work should control how that work is used and should be compensated when others profit from it," said A.G. Sulzberger, the Times's publisher, in a statement. "If the law does not protect the work of journalists, it will not protect the work of anyone."

The complaint documents hundreds of instances in which ChatGPT and Microsoft's Copilot reproduced Times articles nearly verbatim when prompted with specific queries. In one exhibit, a user prompt asking about a 2023 Times investigation into New York City taxi medallion fraud produced a response that replicated 16 consecutive paragraphs of the original article, including direct quotes and specific data points. Another exhibit showed Copilot generating a response to a health-related query that was nearly identical to a Times Wirecutter product review, including the reviewer's subjective opinions and purchasing recommendations.

"OpenAI didn't create the knowledge in ChatGPT. It harvested that knowledge from the people and institutions that spent decades and billions of dollars producing it. That's not innovation. That's free-riding on an industrial scale."

— A.G. Sulzberger, Publisher, The New York Times

OpenAI responded to the lawsuit with a statement calling the Times's claims "without merit" and arguing that the training of AI models on publicly available text is protected by the fair use doctrine. "The New York Times is not the only source of information in the world, and our models do not memorize or reproduce their articles," the statement read. "Training AI on publicly available data is fair use, consistent with how every search engine, research tool, and text-mining system has operated for decades." Microsoft declined to comment beyond saying it "supports OpenAI's position."

The fair use defense is the central legal battleground. Under U.S. copyright law, fair use permits the unlicensed use of copyrighted material under certain conditions, typically evaluated across four factors: the purpose and character of the use, the nature of the copyrighted work, the amount used relative to the whole, and the effect on the market for the original work. OpenAI argues that training an AI model is a "transformative" use — the model doesn't reproduce the articles but learns patterns from them — and therefore falls squarely within fair use protections.

Legal scholars are deeply divided. Rebecca Tushnet, a Harvard Law School professor who specializes in copyright, said that the transformative use argument has some merit but faces significant challenges. "The question the court has to answer is whether an AI model that can reproduce near-verbatim copies of copyrighted articles is really 'transforming' those articles or just storing them in a more sophisticated way," she said. "The reproduction evidence in the Times complaint is some of the strongest I've seen in any AI copyright case."

The economic stakes are enormous — and not just for the Times. A ruling in the Times's favor could establish that AI companies must license training data from every content creator whose work they use, a requirement that would fundamentally alter the economics of the AI industry. OpenAI's training datasets are estimated to include text from tens of millions of copyrighted sources, including books, academic journals, news articles, blog posts, and social media content. Retroactive licensing at market rates could cost tens of billions of dollars.

Conversely, a ruling in OpenAI's favor would effectively establish that training AI on copyrighted material is legal without permission or payment — a precedent that publishers, authors, and artists warn would strip creators of any leverage in the AI economy. "If the court says this is fair use, every media company in the world becomes an unpaid supplier to the richest technology companies on earth," said Danielle Coffey, CEO of the News Media Alliance, which represents more than 2,000 news organizations.

The lawsuit has already reshaped the licensing landscape. Since the Times filed its complaint, OpenAI has accelerated its efforts to sign licensing deals with other publishers, reportedly paying more than $100 million in combined annual fees to The Associated Press, Axel Springer, Le Monde, and several other media organizations. Critics, including the Times, argue that these deals are designed to isolate the paper and create a narrative that most of the industry supports OpenAI's approach. Supporters counter that the deals demonstrate OpenAI's willingness to compensate content creators and that the Times's refusal to negotiate reflects a litigation strategy rather than a principled position.

The case is expected to take years to resolve, with a trial unlikely before 2027 or 2028. In the interim, its shadow looms over every company building or deploying generative AI. "This isn't just about the Times and OpenAI," said James Grimmelmann, a digital copyright scholar at Cornell Law School. "This is the case that will define whether the AI industry operates on a foundation of licensed content or unlicensed extraction. The answer to that question will determine the shape of the AI economy for the next fifty years."