A curated resource hub to help newcomers and researchers navigate Data Foundations of AI.

πŸ“ Survey Papers

A Survey of Data Attribution: Methods, Applications, and Evaluation in the Era of Generative AI

Comprehensive overview of data attribution techniques for modern generative AI systems.

Large Language Models for Data Annotation and Synthesis: A Survey

Explores how LLMs can be leveraged for automated data annotation and synthetic data generation.

A Survey on Data Selection for Language Models

In-depth analysis of data selection strategies and their impact on language model performance.

Data-centric Artificial Intelligence: A Survey

An early survey covering the landscape of data-centric AI approaches and methodologies.

Training Data Influence Analysis and Estimation: A Survey

Systematic review of methods for analyzing and estimating the influence of training data.

πŸ“š Tutorials

Explain AI Models: Methods and Opportunities in Explainable AI, Data-Centric AI, and Mechanistic Interpretability

Tutorial covering the intersection of explainable AI, data-centric approaches, and mechanistic interpretability.

NeurIPS 2025

Advancing Data Selection for Foundation Models: From Heuristics to Principled Methods

Tutorial on modern approaches to data selection for training foundation models, from simple heuristics to principled optimization methods.

NeurIPS 2024

Data Attribution at Scale

Practical guide to implementing data attribution methods for large-scale machine learning systems.

ICML 2024

Foundations of Data-Efficient Learning

Tutorial with a unifying view of theoretically-rigorous approaches for data-efficient machine learning.

ICML 2024

Data Contribution Estimation for Machine Learning

Tutorial on data contribution estimation (DCE) methods for machine learning and natural language processing.

NeurIPS 2023

The Economics of Data and Machine Learning

Tutorial about the value of data from both statistical and economic perspectives, how to effectively price data or information, and how to collect data from economic agents.

AAAI 2023

πŸ› οΈ Software Libraries

Ray Data

Industry-scale data processing tool for distributed machine learning pipelines.

Data Processing

NVIDIA NeMo Curator

Data curation software for processing high-quality training datasets.

Data Curation

Data-Juicer

One-stop system to process text and multimodal data for and with foundation models.

Data Curation

DataTrove

HuggingFace's data curation library for processing and preparing large-scale text data.

Data Curation

RedPajama-Data

Open-source data curation pipeline used to create the RedPajama dataset.

Data Curation

dattri

A library for efficient training data attribution.

Data Attribution

πŸ“Š Datasets and Benchmarks

OpenDataArena

Platform for evaluating and comparing data curation strategies across different domains.

DataComp

Benchmark for evaluating data curation methods in the context of language models and vision-language models.

DataPerf

Benchmark suite for measuring data quality and its impact on model performance.

DynaBench

Dynamic benchmark platform that continuously evolves as models improve.

OpenThoughts

Open dataset and data pipeline for reasoning capabilities in language models.

OpenDataVal

Benchmark for evaluating data valuation methods across diverse machine learning tasks.

DATE-LM

Benchmark for evaluating data attribution techniques in language models.

πŸ“– Other Educational Materials

Reading Lists

Awesome ML Data Quality Papers

Curated collection of research papers on machine learning data quality and management.

GitHub

Large Language Models for Data Annotation and Synthesis

Curated resources on using LLMs for data annotation tasks and synthetic data generation.

GitHub

Synthetic Data of LLMs, by LLMs, for LLMs

Collection of papers and resources focused on LLM-generated synthetic data and its applications.

GitHub

Seminars

Summer of Data Seminar by Datology AI

A seminar series featuring talks on interesting research in data and pretraining.

Seminar Series

Reading Groups

Data Attribution Reading Group

A reading group on data attribution research in Summer 2024.

Reading Group

Courses

Data-centric AI course from MIT

A intro-level mini-course covering fundamental concepts and practical approaches to data-centric AI.

Mini-course

πŸŽ“ Events

DATA-FM @ ICLR 2026

Rio de Janeiro, Brazil β€’ Apr 26th/27th, 2026

Workshop on data-centric approaches for foundation models.

Submission Due: Feb 6th, 2026

Curated Data for Efficient Learning @ ICCV 2025

Honolulu, HI, US β€’ Oct 20th, 2025

Workshop on data curation strategies for efficient visual learning.

Incentives for Collaborative Learning and Data Sharing

Chicago, IL, US β€’ Aug 13th–15th, 2025

TTIC summer workshop on incentivizing data sharing in collaborative learning.

DataWorld @ ICML 2025

Vancouver, Canada β€’ Jul 19th, 2025

Workshop exploring the role of data in modern machine learning.

DATA-FM @ ICLR 2025

Singapore β€’ Apr 28th, 2025

Workshop on data-centric methods for foundation models.

SynthData @ ICLR 2025

Singapore β€’ Apr 27th, 2025

Workshop on synthetic data generation and its applications.

ATTRIB @ NeurIPS 2024

Vancouver, Canada β€’ Dec 14th, 2024

Workshop on data attribution methods at scale.

DPFM @ ICLR 2024

Vienna, Austria β€’ May 11th, 2024

Workshop on data-centric approaches to foundation models.

ATTRIB @ NeurIPS 2023

New Orleans, LA, US β€’ Dec 15th, 2023

Workshop on attribution methods at scale.

DMLR @ ICML 2023, ICLR 2024, ICML 2024

Workshop series on data-centric machine learning research.

DataPerf @ ICML 2022

Baltimore, MD, US β€’ July 22nd, 2022

Workshop introducing the DataPerf benchmark suite.

DCAI @ NeurIPS 2021

Online β€’ Dec 14th, 2021

Workshop on data-centric AI.