Resources
General resources related to machine learning and data-centric AI that we recommend. For additional resources on the topics covered in lectures, see the references in individual lecture notes.
Open-Source Software Tools for Data-Centric AI
- cleanlab - automatically detect problems in a dataset to facilitate ML with messy, real-world data
- refinery - assess and maintain natural language data
- great expectations - validate, document, and profile data for quality testing
- ydata-profiling - generate summary reports of tabular datasets stored as pandas DataFrame
- cleanvision - automatically detect low-quality images in computer vision datasets
- albumentations - data augmentation for computer vision
- label-studio - interfaces to label and annotate data for many ML tasks
- llamaindex - a data framework for LLM applications (Retrieval-Augmented Generation)
- dspy - algorithmically optimize LLM prompts and bootstrap data
Short Articles
- Unbiggen AI
- Andrew Ng Launches A Campaign For Data-Centric AI
- Tips for a Data-Centric AI Approach
- Data-Centric Approach vs Model-Centric Approach in Machine Learning
- A Linter for ML Datasets
- Handling Mislabeled Data to Improve Your Model
- Catch bad data in LLM datasets
Papers
- A Data Quality-Driven View of MLOps
- Advances in Exploratory Data Analysis, Visualisation and Quality for Data Centric AI Systems
- A Survey on Data Selection for Language Models
Books
- Human-in-the-Loop Machine Learning: Active Learning and Annotation for Human-centered AI
- Best Practices in Data Cleaning: A Complete Guide to Everything You Need to Do Before and After Collecting Your Data