Written by PEER DATA
Last quarter, the head of a $1.8 mm-a-year alternative data shop opened his Bloomberg terminal and felt his stomach drop. A new LLM-powered retail trading app—one that had never signed a license—was publishing a daily “Global Container Sentiment Index” that correlated 0.96 with his flagship product. Same turning points, same magnitude, same lead time over the Baltic Dry Index.
He called his lawyer. The lawyer called me. The punchline: the app wasn’t copying rows. It had simply distilled the economic soul of his dataset out of public text and a few clever prompts. Welcome to late 2025.
Traditional IP law is still doing heavy lifting. The U.S. Copyright Office’s 2025 report and the Northern District of California’s Bartz v. Anthropic decision drew a bright line: training on public price histories or news is almost certainly fair use, but creatively selected, cleaned, weighted, and analyzed datasets remain protected expression. Indexes, proprietary scores, and curated signals are compilations – classic copyright territory.
If someone lifts your exact ESG controversy score or your satellite-derived refinery utilization table, you still win in court tomorrow. That hasn’t changed.
Modern open-weight models (Llama-405B, DeepSeek-V3, etc.) can memorize the decision boundary of your dataset even when trained on completely different text. Feed the model earnings-call transcripts + macroeconomic releases and it will re-derive your earnings-surprise signal with shocking fidelity. No rows were copied; the economics were. Courts have no framework for this yet.
In September 2025, a frontier model correctly predicted 18 of the 20 additions to the MSCI World ESG Leaders Index two weeks before the official announcement. It did the same for a boutique volatility index that charges $400 k a seat. The inputs? Public filings, news, and Reddit sentiment. The output destroyed the index provider’s rebalancing alpha. This is now table stakes for any model with 2025-level reasoning.
At least four startups (two in Singapore, one in London, one hiding in Delaware) openly market “synthetic replicas” of expensive datasets. They train on public proxies, validate correlation against leaked samples, and sell the clone at 15 % of the original price. Their defense: no copyrighted rows touched, purely new expression. The EU’s 2025 Database Directive consultation and the Copyright Office’s synthetic-data study won’t report until mid-2026. Until then, it’s open season.
The winners in 2025 aren’t fighting yesterday’s war. They’re shipping four contract clauses, two pricing innovations, and one technical stack.
(Yes, copy-paste them. Everyone else already has.)
- “No Distillation or Recreation” – Prohibits use of the data to train or fine-tune any model that replicates the material economic behavior of the dataset.
- “No Synthetic Substitute” – Bans creation or distribution of any alternative dataset whose primary commercial purpose is to substitute for the licensed data.
- “Watermark Enforcement” – Requires licensees to honor embedded statistical watermarks and poison pills (more on that below).
- “Model Audit Right” – On 30 days’ notice, you can run a short battery of held-out prompts through their system to detect leakage. (Surprisingly, most big consumers are agreeing to this.)
Smart providers now quote two prices:
Standard license: same as 2023.
“AI-Unrestricted” license: 3–7× higher, often with revenue-share on downstream alpha.
The surprise? Demand for the expensive tier is through the roof. Hedge funds would rather pay 5× than build it themselves and face your lawyers later.
- Dataset fingerprinting: embed faint, robust statistical signatures that survive distillation (2025 papers out of Stanford and Tel Aviv made this practical).
- Provenance ledgers: platforms like PEER DATA’s DBOR 2.0, Databricks Clean Rooms, and two new blockchain-ish startups now give you cryptographically provable usage logs.
- Rate-limited APIs with per-token provenance headers: you can literally see which model version consumed row 319,402 at 03:14 UTC.
The EU AI Act’s systemic-risk model obligations kick in August 2026 – every 70B+ model deployed in Europe will have to disclose “significant datasets” and submit to third-party auditing. The SEC’s pending Reg ATS amendment quietly includes new disclosure rules for alternative-data usage in automated trading systems. For once, the regulators are behind you, not the other way around.
Do nothing → Your signal becomes the open-source version of itself by March.
Over-lawyer and ban everything → You miss every AI partnership and watch revenues flatline.
Ship the new contracts + tech stack before Christmas → You turn the scariest technology in finance into your highest-margin customer segment.
Copyright still protects the rows. The new playbook protects the economics.
The $2 million signal from the opening paragraph? The provider shipped a new license with the four clauses and watermarking in October. The trading app signed it last week—at 6× the old price.
Your move.