Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

AI Fundamentals: Datasets 101

Mon Jul 17 2023

datasetsbenchmarkstokenstokenizationmodel sizestrainingdata compressiontraining data sourcesCommon Crawladditional datasetslegal considerationsethical considerationslicensingdata usagedata set imbalancelanguage models

Description

This episode covers various aspects of datasets, benchmarks, tokens, tokenization, model sizes, training, data compression, training data sources, Common Crawl and other datasets, additional datasets and considerations, legal and ethical considerations, licensing and data usage, data set imbalance, and language models. It explores the differences between datasets and benchmarks, the importance of understanding datasets for building specific models, the role of tokens in deep learning models, the compression of large datasets into smaller parameter sizes, the use of Common Crawl and other datasets for training language models, legal and ethical issues surrounding dataset usage, challenges with data set imbalance and language-specific tokenization, and the impact of different languages on language modeling.

Insights

Datasets vs Benchmarks

Datasets and benchmarks are closely related but have some differences. Understanding datasets is important for building verticalized models for specific use cases.

Tokens and Tokenization

Tokens are integral representations of words, and the same word can have different representations based on its position in a sentence. Transformers make tokenization more efficient by including spaces within tokens instead of having separate tokens for each space.

Model Sizes and Training

GP3 is a 175 billion parameter model trained on around 300 billion tokens. Inference time now matters as AI crosses over from research into practical applications.

Data Compression and Datasets

LLMs are trained on 3000 gigabytes of data and compressed into 350 gigabytes using lossy compression. Stable diffusion compresses images and knowledge into a few gigabytes.

Training Data and Sources

LAMA is a model with 1.2 trillion tokens and 7 billion parameters. Machine learning models like GPT have already done the work for enterprises, so they don't need to collect as much proprietary data.

Common Crawl and Other Datasets

Common Crawl is a nonprofit organization that releases web data, but it has limitations and biases. Google's C4 dataset is a subset of Common Crawl that filters out certain content using heuristics.

Additional Datasets and Considerations

OpenAI's web text has been replicated by the Elutho organization, creating open webtext. The books dataset consists of plain text books for models like GPT.

Legal and Ethical Considerations

Adobe Firefly is a feature in Adobe that doesn't recognize Spider-Man, which is beneficial for companies using Adobe to avoid copyright issues with Disney. Whisper is an OpenAI model used for transcription and ASR (Automatic Speech Recognition).

Licensing and Data Usage

AI providers are being sued over the use of data without specific licensing. GitHub is offering to pay for lawyers to fight on behalf of users who may have used copyright data.

Data Set Imbalance and Language Models

Gaining benchmarks by conveniently forgetting to remove them from data sets is a common practice. Data set imbalance can lead to higher costs for generating language models in languages with more complex tokenization.

Summary

Transcript

Datasets and Benchmarks

00:10 - 07:06

Datasets and benchmarks are closely related but have some differences.
Understanding datasets is important for building verticalized models for specific use cases.
The size of the internet is estimated at around 5 billion gigabytes, while most datasets discussed are in the hundreds of gigabytes range.
OpenAI transcribes YouTube videos to obtain additional tokens for training models.
Benchmarking has been decoupled from the datasets they're trained on, except for loss calculation.
Training on a dataset leads to higher evaluation scores on benchmarks like Common Crawl and Hellous.

Tokens and Tokenization

06:46 - 13:31

Tokens are integers that represent units of text, not just character splitting.
Transformers make tokenization more efficient by including spaces within tokens instead of having separate tokens for each space.
Tokens are integral representations of words, and the same word can have different representations based on its position in a sentence.
A data set with a high number of tokens may still have a lot of repetition, which is not as helpful.
Tokens play a crucial role in deep learning models and their representation affects model performance.

Model Sizes and Training

13:06 - 20:15

GP3 is a 175 billion parameter model trained on around 300 billion tokens.
Chinchilla, a smaller model optimized for compute optimal training, was able to match or beat GP3 despite being 10x smaller.
Inference time now matters as AI crosses over from research into practical applications.
Training models compress datasets, with each parameter representing approximately 1 gigabyte of data.
Models are trained on around 3000 gigabytes of data, which is compressed into the parameter size.

Data Compression and Datasets

19:51 - 26:06

LLMs are trained on 3000 gigabytes of data and compressed into 350 gigabytes using lossy compression
Stable diffusion compresses images and knowledge into a few gigabytes
Quantization allows running large models on devices with limited storage
George Hotz suggests that compressing a person's consciousness or knowledge would require around two gigabytes
Large language model datasets operate at the order of magnitude of one trillion tokens
Smaller models like LAMA achieve higher token counts with lower parameter counts, indicating an evolution in model size requirements

Training Data and Sources

25:41 - 32:49

LAMA is a model with 1.2 trillion tokens and 7 billion parameters.
Machine learning models like GPT have already done the work for enterprises, so they don't need to collect as much proprietary data.
Models for code only require a few dozen or hundred examples to perform well.
Carpathi's presentation at the Microsoft build conference outlined the GPT assisted training pipeline, which includes raw internet data, demonstration data, reinforcement learning with prompts, and comparisons between outputs.
Common Crawl is a nonprofit organization that provides massive amounts of web page data for research purposes.

Common Crawl and Other Datasets

32:21 - 39:30

Common Crawl is a nonprofit organization that releases web data, but it has limitations and biases.
Google's C4 dataset is a subset of Common Crawl that filters out certain content using heuristics.
Both Common Crawl and C4 are commonly used datasets for training language models.
Wikipedia is another dataset used for language models, but it has its own biases due to skewed editor representation.
WebText is a major dataset used in GPT-2 that consists of scraped web page text from Reddit links up to December 2017.

Additional Datasets and Considerations

39:11 - 45:40

OpenAI's web text has been replicated by the Elutho organization, creating open webtext
Reddit shutting down their APIs is a current hot topic
The books dataset consists of 196,000 plain text books for models like GPT
Some books in the dataset have unclear copyright status
Code datasets include Salesforce code gen and the stack from Luther, which is a scrape of GitHub archive
The pile dataset combines multiple smaller datasets into an 825 gigabyte collection
Approximately one third of the contents in the pile dataset are duplicated
Lion is an image dataset created by E-Luther, containing billions of images collected from Common Crawl
Filtering is necessary to remove explicit content and copyrighted images from the Lion dataset

Legal and Ethical Considerations

45:16 - 52:01

Adobe Firefly is a feature in Adobe that doesn't recognize Spider-Man, which is beneficial for companies using Adobe to avoid copyright issues with Disney.
Whisper is an OpenAI model used for transcription and ASR (Automatic Speech Recognition). It performs well and there is no need for another similar model.
Hugging Face Hub and Kaggle are recommended platforms for finding training datasets.
The ratio of dataset producers to dataset consumers is imbalanced, with more people interested in training models than cleaning data.
Companies like Bloomberg and Notion realize the value of their proprietary datasets and are capitalizing on them.
There is a counterculture movement advocating for open data sets and replication of instruction tuning datasets.
Dataset quality is crucial, as low-quality or repetitive data does not contribute much to training models.
Copyright and privacy issues continue to be important considerations in using datasets, with ongoing litigation surrounding fair use and licensing rights.

Licensing and Data Usage

51:36 - 58:24

AI providers are being sued over the use of data without specific licensing
GitHub is offering to pay for lawyers to fight on behalf of users who may have used copyright data
Training data for AI models often comes from sources that have not opted in to be part of the training process
Licensing issues and permissions are emerging areas that are being litigated
Duplicate information in training data can cause models to overfit and waste compute resources
Contamination in benchmarks can lead to inflated performance results
Data set imbalance is a challenge, particularly with English-centric models

Data Set Imbalance and Language Models

58:01 - 1:00:55

Gaining benchmarks by conveniently forgetting to remove them from data sets is a common practice.
Data set imbalance can lead to higher costs for generating language models in languages with more complex tokenization.
Chinese language requires learning 10,000 words to be considered literate, which is absurd.
China is making progress on language modeling and there are Chinese data sets available.
Italian language models are also being developed.
Languages are used differently in different countries, with some being more oral-driven and voice data being crucial.
Data sets play a crucial role in training language models and should be given more attention.
The podcast provides instructions on how to train on the latest face podcast.