Unreliable Data Licensing in AI
A recent Nature Machine Intelligence audit found widespread problems in dataset licensing for AI training. Over 70% of datasets lacked clear license information, and many repositories mislabeled data with more permissive terms than the originals—creating significant copyright and compliance risks.
Researchers developed the Data Provenance Explorer (DPExplorer), a tool and taxonomy aimed at tracing data provenance across nearly 1,900 finetuning datasets. It has exposed systemic issues in licensing and documentation and calls attention to transparency needs for legal and compliance professionals to trace dataset lineage, correct license attribution, and generate “provenance cards.” Their reannotation work showed that two-thirds of Hugging Face dataset licenses were looser than the original sources.
Takeaway: Businesses and counsel should treat dataset licensing as a material risk. Favor clearly documented and commercially permissive data, require provenance documentation, and contractually address licensing uncertainties. Regulators are expected to demand higher standards of transparency, making proactive governance a competitive advantage.
A large-scale audit of dataset licensing and attribution in AI
Longpre, S., Mahari, R., Chen, A., Obeng-Marnu, N., Sileo, D., Brannon, W., Muennighoff, N., Khazam, N., Kabbara, J., Perisetla, K., Wu, X., Shippole, E., Bollacker, K., Wu, T., Villa, L., Pentland, S., & Hooker, S. (2024). A large-scale audit of dataset licensing and attribution in AI. Nature Machine Intelligence, 6(8), 975–987. https://doi.org/10.1038/s42256-024-00878-8
_______________________________
Disclaimer: This blog post is provided for informational purposes only and does not constitute legal advice. The linked article is the work of its respective author(s) and publication, with full attribution provided. BAYPOINT LAW is not affiliated with the author(s) or publication; it is shared solely as a matter of professional interest.