“Clean Data” Debate in Generative AI
In the last 30 days two landmark copyright decisions came out of the Northern District of California and, one week later, the European Parliament released a 175‑page study on generative‑AI training. Both say the same thing in different ways: right‑now there is no recognised market for licensing books, images or music as AI‑training fodder. U.S. judges treat that vacuum as proof of “no market harm,” while EU policy‑makers call it a market failure that must be fixed.
Judge Chhabria (Kadrey v. Meta): “Llama is not capable of generating enough text from the plaintiffs’ books to matter, and the plaintiffs are not entitled to the market for licensing their works as AI training data.” [1]
Judge Alsup (Bartz v. Anthropic): “A market could develop … Even so, such a market … is not one the Copyright Act entitles Authors to exploit.” [2]
EU Parliament study: Current EU law “leaves creators without any enforceable mechanism to authorise, deny, or license the use of their works for AI training under negotiated terms.” [3]
June 2025 Northern District of California Decisions
Bartz v. Anthropic (Alsup J., 23 June 2025)
Three authors said Anthropic scanned millions of print and pirate‑site books and used them to train Claude. Alsup found that the training copies were fair use because the model transforms whole books into “statistical abstractions” that never substitute for the originals. On market harm (Factor 4) he accepted, for argument’s sake, that a licensing market could emerge, but held it is “not one the Copyright Act entitles Authors to exploit.” The only infringement he left standing is Anthropic’s internal “pirated library,” which will go to trial on damages. In short: training is safe (for now), while hoarding pirated PDFs not. [4]
Kadrey v. Meta (Chhabria J., 25 June 2025)
Thirteen novelists sued Meta for scraping “shadow libraries” to build Llama. The court said Llama’s weights are “highly transformative”: they store patterns, not expressive chunks. Meta’s expert ran “adversarial prompting” experiments and could not coax any Llama model to emit more than 50 consecutive tokens (≈ 50 words) from any plaintiff’s book. The plaintiffs’ own expert agreed that Llama could not reproduce “any significant percentage” of the texts.[5]
Meta also submitted testimony that none of the 13 authors has ever licensed – or even been asked to license – a book for AI-training purposes. If no market exists, Meta’s unlicensed use cannot depress it and there is nothing for the copyright holder to lose.
On Factor 4 Chhabria called the licensing‑market theory a “clear loser,” writing that the authors “are not entitled to the market for licensing their works as AI training data.” He granted Meta summary judgment on training copies but flagged that better evidence of market dilution could swing future cases.
Both judges imposed an empirical burden on plaintiffs: show a functioning or nascent licensing market. Until such a market exists – with prices, contracts, measurable revenue, AI models enjoy a sizeable fair‑use advantage.
EU’s Generative AI and Copyright: Training, Creation, Regulation
The Parliament’s Justice Committee study, published on 30 June 2025, suggested that there should be some kind of remuneration for authors who’s works are used for AI training. However, the Committee, just like the US Courts recognized that any protection of such economic rights cannot be enforced as authors have no practical way to license their work for AI training.
In both cases – US and EU – it comes down to whether the licensing market for authors data vis a vis AI training models exists or not. The US Courts will wait for more evidence from future plaintiffs. As for EU, the Committee proposed pro-active steps to establish such market and build licensing channel at a legislation level.
If Europe builds a paid licensing channel, U.S. plaintiffs could soon point to that very market to show cognizable harm, erasing the defense advantage that Meta and Anthropic just enjoyed.
What Happens when the “Licensing Void” Fills
Models that can prove “clean data” will cost more
Adobe’s Firefly image tool promotes itself as “commercially safe” because it was trained only on Adobe Stock, public‑domain and openly licensed pictures.[6] On the supply side, Photobucket is asking 5 cents to $1 per photo to license its 13 billion‑image archive which is content that used to sit online for free.[7]
Public‑domain or Creative Commons Attribution licence (CC BY) material remains free, but any option that comes with a clear paper trail will carry a premium. After all, it protects both users and developers from potential lawsuits.
A two‑tier data economy will settle in
Tier I: premium, traceable datasets such as newspapers, professional photo libraries, industry research all licensed through collective deals or pay per asset APIs.
Reddit has already priced its user posts at ≈ $60 million a year for Google. [8]
Tier II: public‑domain text, Wikipedia, government data that is still free and legal in the US under fair‑use rules. US likely keeps fair-use baseline, but market will converge on global “reference rates” for high value sectors like music, images, specialized text.
Over time, prices in the premium lane will settle into “reference rates” (think “$x per photo” or “$y per song”) for high‑value sectors such as images, music and specialist text.
Data marketplaces will feel like app stores
Cloud vendors already host one‑click catalogues of “ready‑to‑license” filesets. Amazon Web Services Data Exchange lists 3,000‑plus commercial and 1,000‑plus free datasets covering everything from news wires to medical scans.[9]
As deals scale, AI model builders will bolt on leakage‑testing dashboards and provenance logs to reassure investors and to comply with the EU AI Act’s rule saying big models must publish a training‑data summary.[10]
Practical take-aways for stakeholders
Data is becoming a tradeable commodity. Until recently, tech companies scraped whatever text or images they could find on the open web for free. Now, they are being pushed to pay for clean, permission-based datasets.
EU is signaling that silence will no longer equal consent. Successful AI products in the next decade will be those built on traceable, fairly acquired data streams.
The era of free-for-all scraping is giving way to a regulated data-supply chain where proof of origin and proof of non-harm will decide who can train, and at what price.
Specialized data brokers are springing up. Think of them as “Spotify” for training data, platforms where you can subscribe to texts, images, or medical scans. This will make buying data easier and give creator a single place to license their work. But consequently, this will raise prices for high-quality collection of data for end-users.
The AI training‑dataset market, valued at USD 2.6 billion in 2024, is projected to reach USD 8.6 billion by 2030, expanding at a CAGR of 21.9 %.[11]
Every day AI tools like search assistants, email co‑pilots, and simple image generators will likely remain low‑cost or free. But the “niche” models those offering medical advice, legal research, or high‑end graphics will come with a higher price tag, and they’ll be upfront about why: “built on licensed data, with $50K indemnification.” You’ll also start seeing little provenance badges everywhere, so you’ll always know which outputs are safe to reuse in your own projects.
The June 2025 US decisions didn’t green‑light endless scraping, they just pointed out that authors couldn’t show any market harm yet, since no market exists. Then, the EU Parliament study proposed creating that market and putting a price on data.
Even if no court or legislative backing is fully achieved, the magnitude of AI boom and its dependency on quality data creates so much upside for authors that we may see the emergence of siloed platforms which will hold the keys to the quality data. They will be a gatekeeper to keep AI agents or their APIs away from such data unless subscription is paid. If this happens, the quality data may gradually reduce in public domain and move into pockets of pre-licensed platforms.
[1] https://law.justia.com/cases/federal/district-courts/california/candce/3:2023cv03417/415175/598/
[2] https://cases.justia.com/federal/district-courts/california/candce/3%3A2024cv05417/434709/231/0.pdf
[3] https://www.europarl.europa.eu/RegData/etudes/STUD/2025/774095/IUST_STU%282025%29774095_EN.pdf
[4] https://cases.justia.com/federal/district-courts/california/candce/3%3A2024cv05417/434709/231/0.pdf
[5] https://media.npr.org/assets/artslife/arts/2025/order1.pdf
[6] https://helpx.adobe.com/firefly/get-set-up/learn-the-basics/adobe-firefly-faq.html
[7] https://www.diyphotography.net/photobucket-to-license-13-billion-images-for-ai-training/
[8] https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/?utm_source=chatgpt.com
[9] https://registry.opendata.aws/
[10] https://www.euaiact.com/key-issue/5 ;
https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence
[11] https://www.grandviewresearch.com/industry-analysis/ai-training-dataset-market