Replit AI Podcast

03: The Next Generation of LLMs with MosaicML

Thu Jun 29 2023

Large Language ModelsAIOpen SourceData SetsInference PerformanceCollaboration

Description

This episode covers various topics related to large language models (LLMs) and their impact on the industry. It discusses the recent release of the 30B model, training larger models with longer context sizes, challenges and advantages of long contexts in generative models, exploring data mixes and architecture improvements in LLMs, the importance of open source in AI, challenges and potential of open source AI models, the importance of creativity and experimentation in AI, exploring and improving existing datasets for AI models, the impact of model size and shape on inference performance, and the potential of wide models and the importance of collaboration.

Insights

Jonathan Frankel joining Databricks as Chief Neural Network Scientist

Jonathan Frankel is joining Databricks as the Chief Neural Network Scientist, pending regulatory approval

The recent release of the 30B model and its impact on the industry

The recent release of the 30B model provides a commercially usable artifact for the community. Training bigger models is more expensive but has shown improvements in metrics. The team trained a 7B model and then a 30B model using the same code base. The H100 GPUs were used for training and performed exceptionally well. Optimizations were made to squeeze as much power as possible from the architecture. The team plans to push boundaries with future releases using H1_UNDR GPUs.

Exploring the balance between H100s and new chips for best performance

GPUs are getting three times faster every three years. There is a question about the right architecture to balance H100s and new chips for best performance.

Benefits of training larger models with longer context size

Trained a half trillion tokens on the 7B model over the weekend. The 30B model has an 8K context size. Training a model with larger context size is not significantly more painful. Longer sequence lengths can be achieved for free with bigger models. Accuracy does not seem to be affected by the larger context size.

Challenges and advantages of long contexts in generative models

Evaluation of generative models is challenging. Short sequences tend to perform well on certain tasks, but that doesn't mean they are the best models overall. Models currently don't support long contexts, but even if they do, it's unclear if they can effectively use them for meaningful attention across the entire document. Retrieval tasks benefit from longer context lengths, but the main advantage of longer contexts is being able to do attention across the document. Benchmarking long sequence models is challenging, as human evaluation with a 64K sequence length would require reading an entire book before performing a task. Repl applications for long context models can provide user feedback quickly and determine whether longer contexts help or hurt performance. Benchmark evaluations help distinguish between bad and good models, but may not provide clear distinctions among good models. Evaluations have limitations in providing insight into model selection, as different data mixes and metrics may yield similar results.

Exploring data mixes and architecture improvements in LLMs

Evals can help separate a bad model from a good model, but not distinguish among good models. Experimenting with different data mixes for the replica model could reveal distinctions in popularity and completion rate. There are improvements in terms of data parts and architecture in LLMs. Optimizing architectures for H100s and exploring mixture of experts are exciting developments. Lower precision training, floating point A, and low bit optimizers are being explored to improve training speed and memory usage. Alem Foundry is an open-source GitHub repo that contains all of their LMM code. It allows outsiders to see their development progress.

The importance of open source in AI and the challenges it faces

An open source implementation of GPT-3 has been added and is being tested. Mosaic is committed to open source and sharing their models. Collaboration among open-source groups is important for innovation. The future of technology relies on open-source AI efforts. Science may slow down without the sharing culture of big labs. Individual researchers should consider working for companies committed to open source. The power of individuals and collaboration can drive progress in the field. Open source alliances are crucial for the future. The AI ecosystem has seen significant growth in contributors and repositories.

The challenges and potential of open source AI models

The AI open source ecosystem has grown significantly, with a large number of contributors and GitHub repositories focused on AI. Open source is still lagging behind in some aspects, but catching up quickly. Long-term investment in open source AI models is challenging, as there is a need for bigger and more unique models. Good human-generated instruction data sets are currently lacking in the open source community. Investing in larger and longer-term artifacts is necessary to achieve high-quality open source models like GPT-4. Money and high-quality data are major challenges in developing advanced open source models. Crowdsourcing diverse conversations and quality checking can help fill the gap in instruction data sets. Copying existing models with minor variations is not interesting or productive; more directed efforts and collaboration are needed to match GPT-3.5 capabilities across different domains. Releasing models that do something different and unique can lead to interesting outcomes.

The importance of creativity and experimentation in AI

Challenging the team to do something different with every model release. Emphasizing the need for creativity and different ideas in the community. Advocating for trying new things and being willing to do weird stuff. Importance of fine tuning as a way to experiment and be different. Driving down the cost of experimentation at Mosaic. Acknowledging the need for more work on data sets and their limitations. Suggesting creating high-quality data sets as a contribution without compute budget. Recognizing that data work is often unglamorous but crucial. Highlighting opportunities for improvement in existing data sets. Comparing and contrasting datasets to learn from them.

Exploring and improving existing datasets for AI models

Comparing and contrasting datasets like Pile Wikipedia and Red Pajama Wikipedia. Red Pajama dataset often only has the first paragraph of an article. Pile dataset is missing information in tables and bulleted lists. Consider recreating a Wikipedia dataset in Markdown format. Common Crawl has interesting variations with HTML, JavaScript, etc. JavaScript in Common Crawl mostly minified and excluded from MPT models. Working on filtering JavaScript data for better results. Joining Databricks to have access to better tools for working with massive datasets. Goal at Mosaic is to produce constantly up-to-date versions of datasets like common crawl, Wikipedia, and GitHub. Open sourcing the project to benefit the community and attract more customers. Aligning Mosaic's business model with open source and open data initiatives. Building the company around a product rather than just selling secrets or efficiency. Driving down the cost of training models to expand the market size. Challenging the team to make machine learning easier and cheaper for everyone. Revolutionizing open source work on inference infrastructure for production use.

The impact of model size and shape on inference performance

Open source is revolutionizing the work being done on inference. The panel discusses the intentional model size choices, such as GPT-3 with approximately 30-33 billion parameters. Inference on a single GPU without low precision or other optimizations is possible with the chosen model size. Distillation may allow for separate architectures for training and inference. The choice of 3B model size was based on latency targets and ablation testing on A100s. The current inference latency is around 250 milliseconds P95, which is considered an amazing result. Testing on H100s shows improved performance compared to standard inference times. Model shape is becoming important in addition to model size, with wider models showing better GPU utilization. There is potential for even larger models that can utilize the faster GPUs effectively. Replitz plans to open source their technology to make it more accessible and portable. The choice of a 3B model allows it to run locally with limited resources and be used as a code assistant chat bot by the community.

The potential of wide models and the importance of collaboration

The elegance of a 3 billion parameter model that performs well across the board and is open sourceable. Wide models can take advantage of CPUs. Neural network pruning works well on CPUs. There's more we can do beyond quantizing models for CPUs. Excitement about working together with Mosaic Plus DataBricks. Questions from attendees will be addressed on Twitter.

Chapters

Jonathan Frankel joining Databricks as Chief Neural Network Scientist
The recent release of the 30B model and its impact on the industry
Exploring the balance between H100s and new chips for best performance
Benefits of training larger models with longer context size
Challenges and advantages of long contexts in generative models
Exploring data mixes and architecture improvements in LLMs
The importance of open source in AI and the challenges it faces
The challenges and potential of open source AI models
The importance of creativity and experimentation in AI
Exploring and improving existing datasets for AI models
The impact of model size and shape on inference performance
The potential of wide models and the importance of collaboration

Summary

Transcript

Jonathan Frankel joining Databricks as Chief Neural Network Scientist

00:00 - 06:53

Jonathan Frankel is joining Databricks as the Chief Neural Network Scientist, pending regulatory approval

The recent release of the 30B model and its impact on the industry

00:00 - 06:53

The recent release of the 30B model is exciting because it provides a commercially usable artifact for the community
Training bigger models is more expensive but they have shown improvements in metrics
The team trained a 7B model and then a 30B model using the same code base
The H100 GPUs were used for training and performed exceptionally well
Optimizations were made to squeeze as much power as possible from the architecture
The team plans to push boundaries with future releases using H1_UNDR GPUs

Exploring the balance between H100s and new chips for best performance

06:24 - 12:47

GPUs are getting three times faster every three years
Questioning the right architecture to balance H100s and new chips for best performance

Benefits of training larger models with longer context size

06:24 - 12:47

Trained a half trillion tokens on the 7B model over the weekend
30B model has an 8K context size
Training a model with larger context size is not significantly more painful
Longer sequence lengths can be achieved for free with bigger models
Accuracy does not seem to be affected by the larger context size

Challenges and advantages of long contexts in generative models

12:25 - 19:17

Evaluation of generative models is challenging
Short sequences tend to perform well on certain tasks, but that doesn't mean they are the best models overall
Models currently don't support long contexts, but even if they do, it's unclear if they can effectively use them for meaningful attention across the entire document.
Retrieval tasks benefit from longer context lengths, but the main advantage of longer contexts is being able to do attention across the document.
Benchmarking long sequence models is challenging, as human evaluation with a 64K sequence length would require reading an entire book before performing a task.
Repl applications for long context models can provide user feedback quickly and determine whether longer contexts help or hurt performance.
Benchmark evaluations help distinguish between bad and good models, but may not provide clear distinctions among good models.
Evaluations have limitations in providing insight into model selection, as different data mixes and metrics may yield similar results.

Exploring data mixes and architecture improvements in LLMs

18:48 - 25:03

Evals can help separate a bad model from a good model, but not distinguish among good models.
Experimenting with different data mixes for the replica model could reveal distinctions in popularity and completion rate.
There are improvements in terms of data parts and architecture in LLMs.
Optimizing architectures for H100s and exploring mixture of experts are exciting developments.
Lower precision training, floating point A, and low bit optimizers are being explored to improve training speed and memory usage.
Alem Foundry is an open-source GitHub repo that contains all of their LMM code. It allows outsiders to see their development progress.

The importance of open source in AI and the challenges it faces

24:37 - 31:35

An open source implementation of GPT-3 has been added and is being tested
Mosaic is committed to open source and sharing their models
Collaboration among open-source groups is important for innovation
The future of technology relies on open-source AI efforts
Science may slow down without the sharing culture of big labs
Individual researchers should consider working for companies committed to open source
The power of individuals and collaboration can drive progress in the field
Open source alliances are crucial for the future
The AI ecosystem has seen significant growth in contributors and repositories

The challenges and potential of open source AI models

31:11 - 38:07

The AI open source ecosystem has grown significantly, with a large number of contributors and GitHub repositories focused on AI.
Open source is still lagging behind in some aspects, but catching up quickly.
Long-term investment in open source AI models is challenging, as there is a need for bigger and more unique models.
Good human-generated instruction data sets are currently lacking in the open source community.
Investing in larger and longer-term artifacts is necessary to achieve high-quality open source models like GPT-4.
Money and high-quality data are major challenges in developing advanced open source models.
Crowdsourcing diverse conversations and quality checking can help fill the gap in instruction data sets.
Copying existing models with minor variations is not interesting or productive; more directed efforts and collaboration are needed to match GPT-3.5 capabilities across different domains.
Releasing models that do something different and unique can lead to interesting outcomes.

The importance of creativity and experimentation in AI

37:38 - 44:23

Challenging the team to do something different with every model release
Emphasizing the need for creativity and different ideas in the community
Advocating for trying new things and being willing to do weird stuff
Importance of fine tuning as a way to experiment and be different
Driving down the cost of experimentation at Mosaic
Acknowledging the need for more work on data sets and their limitations
Suggesting creating high-quality data sets as a contribution without compute budget
Recognizing that data work is often unglamorous but crucial
Highlighting opportunities for improvement in existing data sets
Comparing and contrasting datasets to learn from them

Exploring and improving existing datasets for AI models

43:53 - 50:33

Comparing and contrasting datasets like Pile Wikipedia and Red Pajama Wikipedia
Red Pajama dataset often only has the first paragraph of an article
Pile dataset is missing information in tables and bulleted lists
Consider recreating a Wikipedia dataset in Markdown format
Common Crawl has interesting variations with HTML, JavaScript, etc.
JavaScript in Common Crawl mostly minified and excluded from MPT models
Working on filtering JavaScript data for better results
Joining Databricks to have access to better tools for working with massive datasets
Goal at Mosaic is to produce constantly up-to-date versions of datasets like common crawl, Wikipedia, and GitHub
Open sourcing the project to benefit the community and attract more customers
Aligning Mosaic's business model with open source and open data initiatives
Building the company around a product rather than just selling secrets or efficiency
Driving down the cost of training models to expand the market size
Challenging the team to make machine learning easier and cheaper for everyone
Revolutionizing open source work on inference infrastructure for production use

The impact of model size and shape on inference performance

50:13 - 57:47

Open source is revolutionizing the work being done on inference.
The panel discusses the intentional model size choices, such as GPT-3 with approximately 30-33 billion parameters.
Inference on a single GPU without low precision or other optimizations is possible with the chosen model size.
Distillation may allow for separate architectures for training and inference.
The choice of 3B model size was based on latency targets and ablation testing on A100s.
The current inference latency is around 250 milliseconds P95, which is considered an amazing result.
Testing on H100s shows improved performance compared to standard inference times.
Model shape is becoming important in addition to model size, with wider models showing better GPU utilization.
There is potential for even larger models that can utilize the faster GPUs effectively.
Replitz plans to open source their technology to make it more accessible and portable.
The choice of a 3B model allows it to run locally with limited resources and be used as a code assistant chat bot by the community.

The potential of wide models and the importance of collaboration

57:18 - 59:32

The elegance of a 3 billion parameter model that performs well across the board and is open sourceable
Wide models can take advantage of CPUs
Neural network pruning works well on CPUs
There's more we can do beyond quantizing models for CPUs
Excitement about working together with Mosaic Plus DataBricks
Questions from attendees will be addressed on Twitter