Training Data and Copyright: The Scraping Legal Battle

Umma Gohil

Training Data and Copyright: The Scraping Legal Battle

Jan 4, 2026

5 min read

robots.txt, ToS violations and the current lawsuits

The intersection of copyright law and artificial intelligence is currently the most volatile battleground in technology. As LLMs require massive datasets to learn, the question of how that data is acquired has triggered a wave of lawsuits that will define the future of the internet.

For decades, the web operated on a “gentleman’s agreement” known as robots.txt. This text file acts as a gatekeeper, instructing automated bots which parts of a site they are permitted to access. However, from a legal standpoint, robots.txt is a voluntary protocol, not a binding contract. Ignoring it is arguably rude or unethical, but it does not automatically constitute a crime or a breach of law.

Terms of Service (ToS), on the other hand, are legally binding contracts. The distinction is critical in current litigation.

In hiQ Labs v LinkedIn (2019),¹ the Ninth Circuit ruled that scraping publicly accessible data does not violate the Computer Fraud and Abuse Act (CFAA), even if the scraper ignores a cease-and-desist letter. This suggests that technical barriers (like a login screen) are required to claim “unauthorised access”.
In X Corp v Bright Data (2024),² a court suggested that contract claims (violating ToS) might be preempted by copyright law if the platform doesn’t own the underlying content.

This creates a paradox where contractual law protects data differently across jurisdictions. For instance, while US courts debate the enforceability of ToS against scrapers, the UK’s Information Commissioner’s Office (ICO) has taken a different angle. They emphasise that scraping personal data for AI training must satisfy a “Legitimate Interest” test under the Data Protection Act 2018 (and UK GDPR).³ Unlike the US focus on contract violation, the UK approach suggests that if you cannot prove scraping is strictly necessary (passing the “Purpose, Necessity, and Balancing” test), you may be in breach of data protection laws regardless of what robots.txt says.

Fair use arguments in AI training

The core defence for AI companies (OpenAI, Anthropic, Google) is Fair Use. They argue that training a model on copyrighted works is “transformative” because it doesn’t reproduce the work for its original expressive purpose, but rather analyses it to learn patterns (syntax, logic, facts).

Arguments for Fair Use:
- Transformative Nature: Just as a human student reads a textbook to learn physics (not to copy the textbook), an AI “reads” the internet to learn language.
- Format Shifting: Digitizing and processing works for internal analysis has precedent in cases like Google Books.
- No Market harm: AI models generally do not output exact copies of the training data, so they do not replace the market for the original book or article.

This defence, however, is far from universal. While the US relies on the flexible “Fair Use” doctrine, other jurisdictions like the UK utilise narrower “Text and Data Mining” (TDM) exceptions found in section 29A of the Copyright, Designs and Patents Act 1988.⁴ Currently, this only permits TDM for non-commercial research. This rigidity means that commercially training a model in London could be copyright infringement, while the same act in San Francisco might be “Fair Use”.

Recent cases illustrate this jurisdictional complexity. In Getty Images (US), Inc v Stability AI Ltd,⁵ the UK High Court faced a scenario where training largely occurred on US servers, leading to the dismissal of direct copyright infringement claims in the UK. However, the court did find potential liability for trademark infringement (due to reproduced watermarks) and interestingly ruled that the AI’s model weights themselves were not “infringing copies” of the original images. This nuanced ruling suggests that while the training process might be legally perilous depending on server location, the resulting model might technically be clear of copyright baggage under UK law.

Technical approaches to respecting copyright in data collection

While the lawyers battle it out, a new technical layer is emerging to give creators control.

User-Agent Blocking: The modern approach to robots.txt involves blocking specific AI bots.
```
User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /
```
This signals to reliable actors (OpenAI, Common Crawl) to step back, but does nothing to stop “wild” scrapers.
NoAI Meta Tags: HTML tags like <meta name="robots" content="noai"> or <meta name="robots" content="noimageai"> are being proposed as a more granular standard than the binary “allow/disallow” of robots.txt.
C2PA and Watermarking: The Coalition for Content Provenance and Authenticity (C2PA) is pushing for cryptographic metadata that travels with a file, proving its origin and usage rights. While this doesn’t strictly prevent scraping, it creates a chain of custody that makes infringement easier to prove.

The resolution of these issues won’t come from a single court case, but likely a mix of new “Fair Learning” legislation and a new technical standard that replaces the aging robots.txt protocol.

hiQ Labs, Inc v LinkedIn Corp, 938 F 3d 985 (9th Cir 2019). ↩
X Corp v Bright Data Ltd, Case No 3:23-cv-03698 (ND Cal 2024). ↩
Data Protection Act 2018, s 2; Regulation (EU) 2016/679 (UK GDPR). ↩
Getty Images (US), Inc v Stability AI Ltd [2025] EWHC 2863 (Ch). ↩