Sourcing Authentic and Reliable Data for AI Development is Problematic

100s of millions users are using generative AI models weekly. These models, are trained in a frenzied race of gathered diverse compilations of data scraped from the web, or synthetically generated. The resulting massive collections of loosely structured data has come with consequences. Commentators in this article discuss practices including the indiscriminate sourcing and bundling data without tracking or vetting their original sources, creator intentions, copyright and licensing status. The lack of transparency has led developers and deployers into ethical and legal challenges.

Training data, prompt inputs, Retrieval Augmented Data (RAG), and fine tuning data are considered among the primary sources of data for ingestion by generative AI models. Although other methods of introducing data to models exist, the referenced article below discussed these methods. The overall topic about source points for the journey of data through an AI model system is referred to as data provenance, a discipline that focuses on the origin, authenticity, and history of data, answering "how and why" it was created. Data lineage, closely related but not the same, tracks the movement and transformation of data across systems, illustrating its entire lifecycle, answering "where and how" it flows. Provenance provides context about the data itself, while lineage provides the path data takes through the broader data landscape., the starting point for documenting data lineage.

This article presents a compelling critique of current data vetting and documentation practices in AI. Highlights cases like the LAION-5B dataset issues and underscores the need for provenance, metadata transparency, and standardized documentation to address PII leaks, licensing, and ethical risks. Data lineage is covered in other curated articles in this blog.

Data Authenticity, Consent, & Provenance for AI Are All Broken

Citation: Shayne Longpre et al., Data Authenticity, Consent, & Provenance for AI Are All Broken: What Will It Take to Fix Them?, 41 Proc. Mach. Learning Rsch. 32711 (2024), https://arxiv.org/abs/2404.12691.

________________________________

Disclaimer: This blog post is provided for informational purposes only and does not constitute legal advice. The linked article is the work of its respective author(s) and publication, with full attribution provided. BAYPOINT LAW is not affiliated with the author(s) or publication; it is shared solely as a matter of professional interest.

Next
Next

Data Lifecycle in Generative AI Development - Is Personal Data Protection Futile?