AI, the creative industries and content scraping: battleground or marketplace?

Insight

12.12.2024

Generative AI services have surged into prominence over the past 18 months and are expected to transform society and the economy. These technologies rely on vast quantities of third-party content, so what are the copyright implications, and what kind of marketplace is developing?

This article looks briefly at the copyright background before exploring the emerging market and the key issues to consider when striking data scraping licence agreements.

Copyright implications: far from clear-cut

At first glance, the bulk scraping of content from internet sources appears to be a straightforward infringement of copyright, as may the serving up of generative AI outputs which draw on third party content. But AI businesses argue that exceptions to copyright protection apply. The legal position varies from territory to territory. The principal battleground is the US, where “fair use” rules are wider than in the UK and give AI businesses some basis to argue that their uses of website content are “transformative” and therefore permitted. In the UK, a text and data mining exception might help AI businesses to avoid copyright infringement, but for the moment at least it only applies in a non-commercial context. The EU is introducing its own rules under the EU AI Act.

Litigation or legislation?

Unauthorised content scraping has led to legal claims, primarily in the US, where The New York Times filed a high-profile lawsuit against OpenAI, the owner of ChatGPT, in December 2023. Additionally, Getty Images is suing Stability AI over the alleged scraping of 12 million images from Getty’s image collection in the US and the UK. These cases continue. If they run their full course, they may eventually provide clearer guidance on the reach of copyright in this new area.

At the same time, both sides of the argument are lobbying governments to legislate to clarify the law, in each case in their own favour. In the UK, for example, AI companies want the Government to extend the text and data mining exception to include commercial use; the creative industries want new laws to embody transparency, consent and enforcement principles. While Prime Minister Keir Starmer told the News Media Association in October 2024 that “we recognise the basic principle that publishers should have control over and seek payment for their work, including when thinking about the role of AI”, the Government is now looking to reconcile support for the creative industries with a desire to attract AI investment and activity to the UK. It is also conscious that this is an issue which would greatly benefit from an international approach.

Or – Licence agreements offering incremental revenue streams

Meanwhile, the industry isn’t waiting for the courts or government, deepening the frustration for publishers and other content generators as AI businesses continue with the relentless scraping of content. It isn’t all bad news though: tools to block scraping by crawlers scavenging for AI training data are getting more effective and sophisticated, as are analytics allowing website owners to measure bot crawling rates. All this makes it harder for AI businesses to gather website data with impunity, and creates more favourable conditions for a negotiated value exchange.

AI businesses have sought to gain easier access as publishers get better at blocking content crawlers, and also to hedge their legal position, by entering into commercial agreements with some major content generators. OpenAI has announced deals over 2023 and 2024 with often politically influential media businesses including Axel Springer (the publisher of Politico), News Corp (reportedly for value of up to $250m over five years), the Financial Times and Conde Nast.

OpenAI isn’t alone, and deals are not confined to news media. Major textbook and academic journal publishers Wiley and Taylor & Francis (part of Informa) have disclosed significant deals with unnamed AI businesses. And Google is reported to have struck a $60m deal with Reddit.

Reported deal values are not transformative, but will often promise an attractive incremental boost to bottom line financials, provided the terms of the deal protect the content provider against cannibalisation by any permitted AI service of its more traditional content-related revenue streams.

Can smaller content providers expect to share in this AI content market?

Reported licence agreements tend to suggest that larger content providers, particularly those with significant political clout, are at the top of AI businesses’ shopping lists when it comes to agreeing licence deals. Particularly in this uncertain period before existing copyright law is clarified by the courts or by legislation. Some of these deals are taking months to negotiate (and some, like the projected New York Times/OpenAI deal, are foundering). So what hope is there for smaller content providers, who may themselves lack the time or resources to embark on a long negotiation?

As AI businesses look to develop more bespoke, enterprise version AI models, they need access to more specialised datasets which are harder to scrape. And as enterprise version customers look to see quotations and source citations in display outputs, the AI businesses are likely to find that fair use is harder to argue. So there are opportunities for content providers of all sizes, particularly those generating specialist content which isn’t routinely published on the web. Academic publisher Wiley, which earned $44m from two deals in 2024, has reported strong “demand in specific disciplines and verticals”.

Data brokerage services are developing to meet the needs of this emerging market, with the aim of streamlining the licensing process for both AI businesses and smaller publishers. Well-funded London-based start-up Human Native is just one of several businesses now involved in protecting and pricing datasets, and pooling them. It describes itself as “bringing together rights holders and AI developers – helping rights holders get compensation for copyrighted works; enabling AI developers to responsibly acquire high quality data”.

What should content providers look out for in licence agreements?

All publishers and content owners will want these deals to be incremental, adding new revenue streams to existing business lines and not substituting them. This inevitably means a close focus on the contract terms and conditions, which will typically be drawn up by the often big tech AI businesses, and which many will find represent a step into the unknown. Key questions for content providers to consider when negotiating deals with AI businesses are as follows:

What content are you licensing? Do you control the rights in question? Is this just existing, “backlist” content, or are you committing to provide new, future content generated over the contract term? And if so, for how long? Do you definitely own sufficient rights in the content to license it on to AI businesses for their intended uses? What is the likely reaction of any upstream rightsholders (most obviously authors)? Some textbook and journal publishers have attracted significant kickback and criticism from authors who were not consulted before AI deals were struck, even where it appears the publishers in question did control the rights.
How easily can you gather the data together and make it available? AI business customers are likely to seek API delivery of content, promptly after any agreement is signed. Some content providers will find this very easy; for others it can be more of a challenge.
What uses are you permitting? Most particularly, is this a deal for “training” use only, or does it extend to output display of your content, for example the right/obligation to reproduce quotations, summaries, citations, links in output results? If the latter, how tightly can you regulate this in negotiations with major AI businesses which are likely to want wide rights, not least because of the speed at which the industry is developing?
What use parameters and restrictions can you insist on? These deals are likely to take most content providers out of their comfort zone (where publishing/licensing/syndication deals are typically of very tightly defined rights, strictly for the duration of the commercial agreement). But what limits do you need to nail down to avoid the risk that the deal may cannibalise your existing licensing and revenue streams? If terms are imprecise, can you insist on an early termination right if AI products using your content cause you real problems down the line? What happens to your data on termination? Big tech AI businesses are likely to negotiate hard, but this is a new industry and norms are not yet fully established.
What price are you being offered/can you negotiate? Reported deals indicate a range of pricing approaches, ranging from a basic flat rate per word to efforts at a revenue share arrangement where the content provider shares in the commercial upside of the AI products created with their content. Some deals also involve part of the AI business’ consideration being delivered in value in kind ervices and access to AI products.

This publication is a general summary of the law. It should not replace legal advice tailored to your specific circumstances.