CEO of image library speaks out amid outrage over harvesting content for AI training data
Rishi Sunak faces a choice between supporting the UK’s creative sector or risking everything on the AI boom, according to Getty Images CEO Craig Peters. Peters, leading the image library since 2019, expressed concern about the use of creative material for AI training data, with Getty Images suing AI generators for copyright infringement in the UK and US. Highlighting the significant contribution of the creative industries to the UK’s GDP, Peters questions the wisdom of prioritizing AI, which currently represents less than a quarter point of the country’s GDP.
In 2023, the government articulated its objective to “address the challenges that AI firms and users encounter” when utilizing copyrighted material. This commitment, in response to a consultation from the intellectual property office, includes supporting AI companies in accessing copyrighted work as a component for their models.
This marked a retreat from a previous suggestion of a comprehensive copyright exception for text and data mining. Responding to a Commons committee on Thursday, Viscount Camrose, the hereditary peer and parliamentary under-secretary of state for artificial intelligence and intellectual property, stated, “We will adopt a balanced and practical stance on the concerns raised, aiming to maintain the UK’s global leadership in AI while bolstering our flourishing creative sectors.”
The use of copyrighted material in AI training is facing growing scrutiny. In the United States, the New York Times has filed a lawsuit against OpenAI, the creator of ChatGPT, and Microsoft for incorporating its news stories into the training data for their AI systems. While OpenAI has not disclosed the specific data used to train GPT-4, the newspaper managed to prompt the AI system to generate exact quotes from New York Times articles.
In a legal document, OpenAI argued that constructing AI systems without utilizing copyrighted materials is unfeasible. The organization stated, “Restricting training data to public domain books and drawings from over a century ago might result in an intriguing experiment, but it would not yield AI systems that meet the demands of contemporary citizens.”
Peters holds a different perspective. Getty Images, in partnership with Nvidia, has developed its image generation AI, exclusively trained on licensed imagery. Peters challenges the argument that such technologies cannot exist with a licensing requirement, stating, “I don’t think that’s the case at all. You need to take different tacks, different approaches, but the notion that there isn’t the capability to do that, that’s just smoke.”
Even within the industry, the trend is shifting. A dataset of pirated ebooks known as Books3, hosted by an AI group with a copyright takedown policy represented by a video involving a choir of clothed women simulating masturbation while singing, was discreetly removed from download following protests from the included authors. However, it had already been utilized to train various AI models, including Meta’s LLaMa AI.
In addition to legal actions by Getty and the New York Times, numerous other lawsuits are advancing against AI companies regarding potential infringement in their training data.
In September, John Grisham, Jodi Picoult, and George RR Martin were among 17 authors who sued OpenAI, alleging “systematic theft on a mass scale.” In January of the previous year, a group of artists filed a lawsuit against two image generators, marking one of the initial cases of its kind to enter the US legal system.
Ultimately, the decisions made by courts or governments regarding the regulation of copyrighted material for training AI systems may not be the ultimate resolution. Several AI models, including text-generating LLMs and image generators, have been released as “open source,” available for free download, sharing, and reuse without any supervision. A prohibition on using copyrighted material for training new systems will not eliminate them from the internet and will likely have limited efficacy in preventing individuals from using new material to retrain, enhance, and re-release these models in the future.
Peters remains hopeful that the outcome is not predetermined. He expressed, “Those generating and disseminating the code are ultimately accountable through legal entities. The issue of what you run on your laptop or phone may be more ambiguous, but individual responsibility comes into play.”