Licensing content to train AI: a new revenue-generating opportunity for UK universities?
Insight
Artificial intelligence (AI) companies are turning to research and content organisations as sources of data to train their AI tools. As information-rich organisations, this presents a revenue-generating opportunity for universities.
The hunger for training data
AI tools need to be trained on vast quantities of content, information and data. The general rule is that the better the input quality, the better the output quality. Some AI companies have trained their tools on training data scraped or mined from the internet. This poses significant legal challenges because, on the face of it, the bulk scraping of content from internet sources looks like a clear-cut infringement of copyright, and generative AI outputs using third-party content may do the same.
The principal battleground is in the US, whose “fair use” rules are wider than in the UK and give AI businesses some basis to argue that their use of website content is “transformative” and therefore permitted. In the UK, a text and data mining exception to copyright infringement is provided for in legislation. However, it is of limited value to AI businesses because, for the moment at least, it only applies in a non-commercial context. In the EU, rights holders can opt out of a general third party right to scrape and mine their content.
The unauthorised scraping of content has led to legal claims, primarily in the US, including a high-profile lawsuit filed by The New York Times against OpenAI, the owner of ChatGPT, in December 2023. Getty Images is suing Stability AI over the alleged scraping of 12 million images from Getty’s image collection in the US and the UK. These cases continue. If they run the full course, they may eventually provide clearer guidance on the reach of copyright in this new area.
Current trends in data and content licensing in the AI field
Both AI companies and publishers are lobbying governments to legislate to clarify the law, in each case in their own favour. In the meantime, AI companies are starting to do deals with content generators: agreeing to pay a licence fee in return for access to and use of content, information and data for AI training purposes.
We are already seeing this play out with news media and publisher organisations and it will be interesting to see how entities in the Higher Education sector may leverage their own position here. Recent notable deals include:
- Open AI has announced deals over 2023 and 2024 with a number of often politically influential media businesses including Axel Springer (the publisher of Politico), News Corp (reportedly for a value of $250m over five years), the Financial Times and Conde Nast.
- Major textbook and academic journal publishers Wiley and Taylor & Francis (part of Informa) have disclosed significant deals with unnamed AI businesses. Similarly, Google is reported to have struck a $60m deal with Reddit.
An opportunity for universities?
Universities are the owners or custodians of vast amounts of high-quality information, content and data, whether generated by research or otherwise. This may be a less immediately obvious source of content for AI companies than (for example) news media content, but we would expect it to be of real interest to them on the basis of: (i) the quality of the content, (ii) the quantity of the content; and (iii) the fact that university content, information and data is less likely to be made fully available online in a “scrapable” format, as compared to news media content, for example.
The challenges
Of course, the rights position is not often straightforward for university content and any exploitation of academic publications, research output or other datasets would need to be carefully aligned with the university’s intellectual property policy, researcher funder terms, expectations of faculty, researchers and staff and the university’s principles and values (and, if relevant, data protection policies and laws).
Some universities may also be subject to the Re-use of Public Sector Information Regulations 2015 which may, in certain specific circumstances, require them to make certain content or data available to third parties for re-use (commercial or non-commercial) on transparent, non-exclusive and non-restrictive terms at marginal cost. Universities which are subject to the Freedom of Information Act 2000 will also likely need to prepare themselves for incoming requests for details about the agreements they reach with AI companies (although there are various exemptions they may be able to rely on to protect sensitive commercial information).
Things to consider include:
- Research publications: Academic custom provides that universities often waive their rights in research publications in favour of the researcher. If universities want to license research publications it will be necessary to check whether the university has all necessary rights to do that and, if it does not (which is likely, given academic custom), options might include consulting with research staff and working with them to develop an AI licensing policy in return for appropriate acknowledgement and remuneration.
- Research datasets: Research datasets may be owned by the university or, sometimes, by the research funder. Faculty may well, and with good reason, carefully control access to research datasets (and funder terms might require this, although charitable funder terms will more often than not require results of research to be made available on as open a basis as possible). Even if universities own the rights in research datasets, the licensing of these would need to be handled with care and with due regard to any faculty expectations, funder terms and confidentiality restrictions.
- Any restrictions around usage of personal data where included in datasets.
- Other information, content and datasets: Universities are likely to hold vast quantities of non-research information, content and data. This may be more straightforward to license to AI companies subject to any relevant IP rights, contractual terms and data protection requirements.
What should universities look out for in licence agreements?
If the rights position is duly regulated and there is researcher and faculty buy in, we expect universities to start considering deals with AI companies to facilitate adding new and incremental revenue streams to existing ones. This inevitably means a close focus on the contract terms and conditions, which will typically be drawn up by the typically big tech AI businesses.
Key questions for universities to consider when negotiating deals with AI businesses, are as follows:
- What content, information and data are you licensing?Is this just existing content, or are you committing to provide new, future content, information and data generated over the contract term? The latter may be higher risk, depending on how tightly you can define permitted uses. And if so, for how long?
- How easily can you gather the data together and make it available?AI business customers are likely to seek API delivery of content, promptly after any agreement is signed.
- What uses are you permitting? Most importantly, is this a deal for “training” use only, or does it extend to output display of your content, eg the right/obligation to reproduce quotations, summaries, citations, links in output results? If the latter, how tightly can you regulate this in negotiations with major AI businesses which are likely to want wide-reaching rights, not least because of the speed at which the industry is developing?
- What use parameters and restrictions can you insist on? These deals are likely to take most universities out of their comfort zone. What limits do you need to nail down? If terms are imprecise, can you insist on an early termination right if AI products using your content cause you real problems down the line? What happens to your data on termination? Big tech AI businesses are likely to negotiate hard, but this is a new industry and norms are not yet fully established.
- What about content authors and upstream rights-holders?As mentioned above, a key question is whether the content, information and data are yours to freely license. Some textbook and journal publishers have attracted significant kickback and criticism from authors who were not consulted before AI deals were struck, even where it appears the publishers in question did control the rights.
- What price are you being offered/can you negotiate?Reported deals from other sectors indicate a range of pricing approaches, ranging from a basic flat rate per word to efforts at a revenue share arrangement where the content provider shares in the commercial upside of the AI products created with their content.
This is a rapidly evolving area, and we will continue to provide updates and support on it to clients and contacts in the Higher Education sector.
Please contact Henry Sainty, Natalie Rimmer, David Copping, or Ethan Ezra or your usual contact at the firm if you have any queries or would like to discuss the issues referred to in this note in more detail.
This publication is a general summary of the law. It should not replace legal advice tailored to your specific circumstances.
© Farrer & Co LLP, December 2024