A curated resource hub to help newcomers and researchers navigate Data Foundations of AI.
π Survey Papers
A Survey of Data Attribution: Methods, Applications, and Evaluation in the Era of Generative AI
Comprehensive overview of data attribution techniques for modern generative AI systems.
Large Language Models for Data Annotation and Synthesis: A Survey
Explores how LLMs can be leveraged for automated data annotation and synthetic data generation.
A Survey on Data Selection for Language Models
In-depth analysis of data selection strategies and their impact on language model performance.
Data-centric Artificial Intelligence: A Survey
An early survey covering the landscape of data-centric AI approaches and methodologies.
Training Data Influence Analysis and Estimation: A Survey
Systematic review of methods for analyzing and estimating the influence of training data.
π Tutorials
Explain AI Models: Methods and Opportunities in Explainable AI, Data-Centric AI, and Mechanistic Interpretability
Tutorial covering the intersection of explainable AI, data-centric approaches, and mechanistic interpretability.
NeurIPS 2025Advancing Data Selection for Foundation Models: From Heuristics to Principled Methods
Tutorial on modern approaches to data selection for training foundation models, from simple heuristics to principled optimization methods.
NeurIPS 2024Data Attribution at Scale
Practical guide to implementing data attribution methods for large-scale machine learning systems.
ICML 2024Foundations of Data-Efficient Learning
Tutorial with a unifying view of theoretically-rigorous approaches for data-efficient machine learning.
ICML 2024Data Contribution Estimation for Machine Learning
Tutorial on data contribution estimation (DCE) methods for machine learning and natural language processing.
NeurIPS 2023The Economics of Data and Machine Learning
Tutorial about the value of data from both statistical and economic perspectives, how to effectively price data or information, and how to collect data from economic agents.
AAAI 2023π οΈ Software Libraries
Ray Data
Industry-scale data processing tool for distributed machine learning pipelines.
Data ProcessingNVIDIA NeMo Curator
Data curation software for processing high-quality training datasets.
Data CurationData-Juicer
One-stop system to process text and multimodal data for and with foundation models.
Data CurationDataTrove
HuggingFace's data curation library for processing and preparing large-scale text data.
Data CurationRedPajama-Data
Open-source data curation pipeline used to create the RedPajama dataset.
Data Curationπ Datasets and Benchmarks
OpenDataArena
Platform for evaluating and comparing data curation strategies across different domains.
DataComp
Benchmark for evaluating data curation methods in the context of language models and vision-language models.
DataPerf
Benchmark suite for measuring data quality and its impact on model performance.
DynaBench
Dynamic benchmark platform that continuously evolves as models improve.
OpenThoughts
Open dataset and data pipeline for reasoning capabilities in language models.
OpenDataVal
Benchmark for evaluating data valuation methods across diverse machine learning tasks.
DATE-LM
Benchmark for evaluating data attribution techniques in language models.
π Other Educational Materials
Reading Lists
Awesome ML Data Quality Papers
Curated collection of research papers on machine learning data quality and management.
GitHubLarge Language Models for Data Annotation and Synthesis
Curated resources on using LLMs for data annotation tasks and synthetic data generation.
GitHubSynthetic Data of LLMs, by LLMs, for LLMs
Collection of papers and resources focused on LLM-generated synthetic data and its applications.
GitHubSeminars
Summer of Data Seminar by Datology AI
A seminar series featuring talks on interesting research in data and pretraining.
Seminar SeriesReading Groups
Data Attribution Reading Group
A reading group on data attribution research in Summer 2024.
Reading GroupCourses
Data-centric AI course from MIT
A intro-level mini-course covering fundamental concepts and practical approaches to data-centric AI.
Mini-courseπ Events
DATA-FM @ ICLR 2026
Rio de Janeiro, Brazil β’ Apr 26th/27th, 2026
Workshop on data-centric approaches for foundation models.
Submission Due: Feb 6th, 2026Curated Data for Efficient Learning @ ICCV 2025
Honolulu, HI, US β’ Oct 20th, 2025
Workshop on data curation strategies for efficient visual learning.
Incentives for Collaborative Learning and Data Sharing
Chicago, IL, US β’ Aug 13thβ15th, 2025
TTIC summer workshop on incentivizing data sharing in collaborative learning.
DataWorld @ ICML 2025
Vancouver, Canada β’ Jul 19th, 2025
Workshop exploring the role of data in modern machine learning.
DATA-FM @ ICLR 2025
Singapore β’ Apr 28th, 2025
Workshop on data-centric methods for foundation models.
SynthData @ ICLR 2025
Singapore β’ Apr 27th, 2025
Workshop on synthetic data generation and its applications.
ATTRIB @ NeurIPS 2024
Vancouver, Canada β’ Dec 14th, 2024
Workshop on data attribution methods at scale.
DPFM @ ICLR 2024
Vienna, Austria β’ May 11th, 2024
Workshop on data-centric approaches to foundation models.
ATTRIB @ NeurIPS 2023
New Orleans, LA, US β’ Dec 15th, 2023
Workshop on attribution methods at scale.
DMLR @ ICML 2023, ICLR 2024, ICML 2024
Workshop series on data-centric machine learning research.
DataPerf @ ICML 2022
Baltimore, MD, US β’ July 22nd, 2022
Workshop introducing the DataPerf benchmark suite.