OpenAI says it’s “impossible” to create useful AI models without copyrighted material

ChatGPT developer OpenAI recently acknowledged the necessity of using copyrighted material in the development of AI tools like ChatGPT, The Telegraph reports, saying they would be “impossible” without it. The statement came as part of a submission to the UK’s House of Lords communications and digital select committee inquiry into large language models.

AI models like ChatGPT and the image generator DALL-E gain their abilities from training sessions fed, in part, by large quantities of content scraped from the public Internet without the permission of rights holders (In the case of OpenAI, some of the training content is licensed, however). This sort of free-for-all scraping is part of a longstanding tradition in academic machine learning research, but because deep learning AI models went commercial recently, the practice has come under intense scrutiny.

“Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials,” wrote OpenAI in the House of Lords submission.

Further, OpenAI writes that limiting training data to public domain books and drawings “created more than a century ago” would not provide AI systems that “meet the needs of today’s citizens.”

This statement follows a lawsuit filed last month by The New York Times against OpenAI and Microsoft, a significant investor in OpenAI, for allegedly using the newspaper’s content unlawfully in their products. OpenAI responded to the lawsuit on its website on Monday, claiming that the suit lacks merit and affirming its support for journalism and partnerships with news organizations.

OpenAI’s defense largely rests on the legal principle of fair use, which permits limited use of copyrighted content without the owner’s permission under specific circumstances. The company asserts that copyright law does not prohibit the training of AI models with such material.

“Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents,” OpenAI wrote in its Monday blog post.”We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.”

This is not the first time OpenAI has claimed fair use regarding its AI training data. In August, we reported on a similar situation in which OpenAI defended its use of publicly available materials as fair use in response to a copyright lawsuit involving comedian Sarah Silverman.

OpenAI claimed that the authors in that lawsuit “misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence.”

Source