Software Engineering Daily

Data-Centric AI with Alex Ratner

Thu Jul 20 2023

Machine LearningAIData LabelingProgrammatic LabelingLabeling FunctionsModel Fine-tuningData ScaleAI Development

Description

This episode explores the challenges and solutions in data labeling for machine learning and AI. It covers the importance of labeled data, the limitations of manual labeling, and the benefits of programmatic labeling using tools like Snorkel AI. The episode also discusses the role of labeling functions, denoising and combining them, and the fine-tuning and specialization of models. It highlights the impact of data scale on model performance and the future of AI development with open source models and data-centric operations. The episode concludes with insights on the art of programmatic approaches and the value of AI in various industries.

Insights

Programmatic Labeling with Snorkel AI

Snorkel AI offers a programmatic data labeling and MLOps platform that enables faster and more accurate ML training. It has helped companies reduce manual labeling time and improve accuracy.

Challenges in Data Labeling

Data labeling is time-consuming and often needs to be redone. Programmatic solutions like Flow provide a more efficient workflow for data development.

Labeling Functions for Efficient Data Labeling

Labeling functions raise the abstraction level and allow for efficient labeling of multiple data points. They can be expressed as simple functions, heuristics, or using external resources.

Denoising and Combining Labeling Functions

Labeling functions can be noisy and brittle, but they can be denoised and combined into a clean training set for machine learning models. Large language models can also be used as labeling functions.

Fine-tuning and Specialization of Models

Fine-tuning models with labeled data is crucial for achieving high accuracy. Specialist models outperform generalist models, and further instruction can be done through prompting.

Impact of Data Scale on Model Performance

Data scale has a significant impact on model performance. Open source models may dominate in generalist consumer use cases, while closed models may continue to dominate in specialist enterprise use cases.

The Future of AI Development

Enterprises are increasingly customizing open source models with their unique data. Data-centric operations like sampling, filtering, cleaning, and curating play a crucial role in improving model performance.

Conclusion and Final Thoughts

Programmatic approaches to data labeling are an art that can create the right mixture for ML training. Snorkel works with top companies in various industries and focuses on proving value on specific use cases.

Chapters

Introduction to Machine Learning and AI
Challenges in Data Labeling
Programmatic Labeling and Labeling Functions
Denoising and Combining Labeling Functions
Fine-tuning and Specialization of Models
Impact of Data Scale on Model Performance
The Future of AI Development
Conclusion and Final Thoughts

Summary

Transcript

Introduction to Machine Learning and AI

00:00 - 07:26

Machine learning and AI are used by companies to support real-time product offerings, prevent fraud, and drive innovation.
Training models require labeled data for machines to digest.
As data volumes increase, the challenge of labeling all the data correctly also increases.
Snorkel AI is a programmatic data labeling and MLOps platform that enables faster and more accurate ML training.
Alex Ratner is one of the founders of Snorkel AI and shares the newest developments in this episode.
Data-centric AI focuses on making data operations first class and more programmatic versus manual.
Programmatic labeling makes the process of labeling data for machine learning look more like software development.
Snorkel AI has helped Wayfair reduce manual labeling time from four weeks to under two hours with accuracy improvements.
Getting training data right is crucial for successful machine learning applications in industry.

Challenges in Data Labeling

06:59 - 14:21

Customers expect personalized recommendations and experiences.
Training data is crucial for model development and production.
Production models are constantly being replaced.
Fraud detection is an example of a use case that frequently changes.
Updating the model requires updating the training data.
Data labeling is time-consuming and often needs to be redone.
Practicality and pragmatism are important in addressing these challenges.
Data labeling has been overlooked in data science due to its messiness and complexity.
Getting curated data for specific problems is critical for success.
The traditional approach to labeling involves manual input from subject matter experts.
Semi-automated approaches have become more common, but programmatic solutions are needed.
Flow provides a broader workflow solution for data development.

Programmatic Labeling and Labeling Functions

13:53 - 22:04

Two basic questions in data development: where do I label for best model performance and how do I get information from experts into labels?
Error analysis or guided analysis helps determine where to label for highest impact
Standard approach is manual annotation one data point at a time
Labeling functions provide a way to raise the abstraction level and label multiple data points efficiently
Labeling functions can be expressed as simple functions, heuristics, or using external resources
Large language models can also be used as labeling functions
Labeling functions are interpretable, modifiable, and adaptable because they are code

Denoising and Combining Labeling Functions

21:36 - 28:27

Labeling function uses pattern matching to identify certain phrases as unacceptable
Heuristic or rule-based systems are not perfectly accurate and can be noisy and brittle
Weak supervision is a more efficient but messier approach to labeling data
Theoretical techniques are used to estimate the qualities of labeling functions and combine them
Labeling functions are denoised and combined into a clean training set for machine learning models
Foundation models are large language models trained using self-supervision techniques
Foundation models can be used for generative or predictive use cases
Adaptation is often required beyond foundation models for high accuracy in specific tasks
Further instruction of the model can be done through prompting

Fine-tuning and Specialization of Models

27:58 - 35:04

To achieve high accuracy with language models, further instruction and fine-tuning are necessary for specific tasks.
Fine-tuning requires labeled data, which is crucial for improving performance.
A large pharma company applied GPT-4 to classify and extract information from clinical trial documents.
By labeling some data using a rapid programmatic labeling process, they achieved low 90s accuracy compared to the initial 66% accuracy.
Models trained on web data may not work well with private data or complex domain-specific problems.
Specialist models outperform generalist models by about 25% across various tasks.
The debate around foundation models revolves around whether they should be open source or proprietary.
Circle Flow platform remains neutral and allows users to bring their own foundation model for fine-tuning and distillation.
Scaling up self-supervision has led to unexpected inflection points in capabilities of language models.

Impact of Data Scale on Model Performance

34:40 - 41:31

Google researchers in 2007 showed that training on a hundred times the amount of data improved algorithm performance
Data has always been important, but the extent to which scale impacts performance was underestimated
In machine learning, once algorithm correctness is achieved, data becomes the critical factor
Open source models may dominate in generalist consumer use cases with abundant training data
Closed models may continue to dominate in specialist enterprise use cases with specific objectives
Customizing and refining open source models with enterprise-specific data and feedback is sufficient for most enterprise use cases
Private enterprise data and knowledge are valuable assets for specializing models

The Future of AI Development

41:03 - 48:02

Open source models are becoming more popular and enterprises are customizing them for their specific needs.
Private enterprise data and knowledge are valuable assets that enterprises want to own and utilize.
The speaker mentions the announcement of a new platform called CircleFlow, which supports data labeling, curation, and pre-training of models.
Data-centric operations such as sampling, filtering, cleaning, and curating are crucial in improving model performance.
The speaker highlights the importance of these data-centric operations and emphasizes their role in modern AI development.
Foundry and GenFlow are two components of the broader foundation model data platform being announced by the speaker's company.
The speaker mentions an open-source research project called Data Comp that demonstrated the impact of data selection on model performance.
Enterprises will increasingly rely on open source models that can be customized with their unique data for high accuracy on specific tasks.

Conclusion and Final Thoughts

47:46 - 50:19

Programmatic is more of an art than science today.
Having a workbench that creates a programmatic approach to creating the right mixture is appealing.
Snorkel works with top US banks, pharmaceutical companies, insurance, telecom, healthcare and life sciences, and federal agencies.
There is a lot of interest in AI but also hype and it's important to separate the hype from real value.
Snorkel focuses on proving out value on specific use cases.
The guest suggests doing a separate show on the journey of academics into Silicon Valley startup land.