What just happened? The ongoing controversy over potential copyright infringements related to large language models’ training data has taken a significant turn. The New York Times has sued OpenAI and Microsoft for using millions of its articles to train their systems without permission or compensation.
It’s no secret that LLMs use swaths of information from the internet as training data, but the NYT claims in its copyright infringement lawsuit that its content has been given “particular emphasis.” The suit, filed in Manhattan federal court, claims that the companies “seek to free-ride on the Times’s massive investment in its journalism by using it to build substitutive products without permission or payment.”
The suit states that the millions of the Times’ copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more were used to train the chatbots, which now compete with the news outlet as a source of information.
The lawsuit also highlights information provided by Bing that misidentified the publication’s content. It included “the 15 most heart-healthy foods,” twelve of which had not been mentioned in the Times story. Another claim is that the content generated is verbatim excerpts from NYT articles, meaning the publication is losing viewers and paying customers to the likes of ChatGPT.
The suit says the defendants should be held responsible for “billions of dollars in statutory and actual damages.” It also requests that the companies destroy any chatbot models and training data that use copyrighted material from The Times. OpenAI believes its use of NYT content falls under “fair use” because it serves a new “transformative” purpose.
The suit also spends a good bit of time showing how its content is found in public datasets, such as WebText2, and is also weighted heavily there because of its perceived quality. pic.twitter.com/fO8iE8yAtN
– MatthewBerman (@MatthewBerman) December 28, 2023
It was reported back in August that the Times had been in “tense negotiations” over reaching a licensing deal with OpenAI and Microsoft that would allow the former to legally train its GPT model off of material published by the Times, something the newspaper previously decided to prohibit. But the talks broke down, leading to the current lawsuit. OpenAI already has an agreement in place with Reuters to use its content for training purposes.
Data scraping has made numerous headlines this year. Elon Musk threatened to sue Microsoft in April over a claim that it was illegally using Twitter (as it still was then) data to train AI models. In April, more than 8,000 authors including luminaries such as James Patterson, Margaret Atwood, and Jonathan Franzen signed an open letter asking leaders from the top six AI companies to not use their work for training models without first obtaining consent and offering compensation. Despite this plea, OpenAI has been sued by authors on several occasions for copyright infringement.
In a separate but similar lawsuit, artists launched a copyright lawsuit against AI art generators Stable Diffusion and Midjourney in January.