Replit AI Podcast
03: The Next Generation of LLMs with MosaicML
Thu Jun 29 2023Description
This episode covers various topics related to large language models (LLMs) and their impact on the industry. It discusses the recent release of the 30B model, training larger models with longer context sizes, challenges and advantages of long contexts in generative models, exploring data mixes and architecture improvements in LLMs, the importance of open source in AI, challenges and potential of open source AI models, the importance of creativity and experimentation in AI, exploring and improving existing datasets for AI models, the impact of model size and shape on inference performance, and the potential of wide models and the importance of collaboration.
Insights
Jonathan Frankel joining Databricks as Chief Neural Network Scientist
Jonathan Frankel is joining Databricks as the Chief Neural Network Scientist, pending regulatory approval
The recent release of the 30B model and its impact on the industry
The recent release of the 30B model provides a commercially usable artifact for the community. Training bigger models is more expensive but has shown improvements in metrics. The team trained a 7B model and then a 30B model using the same code base. The H100 GPUs were used for training and performed exceptionally well. Optimizations were made to squeeze as much power as possible from the architecture. The team plans to push boundaries with future releases using H1_UNDR GPUs.
Exploring the balance between H100s and new chips for best performance
GPUs are getting three times faster every three years. There is a question about the right architecture to balance H100s and new chips for best performance.
Benefits of training larger models with longer context size
Trained a half trillion tokens on the 7B model over the weekend. The 30B model has an 8K context size. Training a model with larger context size is not significantly more painful. Longer sequence lengths can be achieved for free with bigger models. Accuracy does not seem to be affected by the larger context size.
Challenges and advantages of long contexts in generative models
Evaluation of generative models is challenging. Short sequences tend to perform well on certain tasks, but that doesn't mean they are the best models overall. Models currently don't support long contexts, but even if they do, it's unclear if they can effectively use them for meaningful attention across the entire document. Retrieval tasks benefit from longer context lengths, but the main advantage of longer contexts is being able to do attention across the document. Benchmarking long sequence models is challenging, as human evaluation with a 64K sequence length would require reading an entire book before performing a task. Repl applications for long context models can provide user feedback quickly and determine whether longer contexts help or hurt performance. Benchmark evaluations help distinguish between bad and good models, but may not provide clear distinctions among good models. Evaluations have limitations in providing insight into model selection, as different data mixes and metrics may yield similar results.
Exploring data mixes and architecture improvements in LLMs
Evals can help separate a bad model from a good model, but not distinguish among good models. Experimenting with different data mixes for the replica model could reveal distinctions in popularity and completion rate. There are improvements in terms of data parts and architecture in LLMs. Optimizing architectures for H100s and exploring mixture of experts are exciting developments. Lower precision training, floating point A, and low bit optimizers are being explored to improve training speed and memory usage. Alem Foundry is an open-source GitHub repo that contains all of their LMM code. It allows outsiders to see their development progress.
The importance of open source in AI and the challenges it faces
An open source implementation of GPT-3 has been added and is being tested. Mosaic is committed to open source and sharing their models. Collaboration among open-source groups is important for innovation. The future of technology relies on open-source AI efforts. Science may slow down without the sharing culture of big labs. Individual researchers should consider working for companies committed to open source. The power of individuals and collaboration can drive progress in the field. Open source alliances are crucial for the future. The AI ecosystem has seen significant growth in contributors and repositories.
The challenges and potential of open source AI models
The AI open source ecosystem has grown significantly, with a large number of contributors and GitHub repositories focused on AI. Open source is still lagging behind in some aspects, but catching up quickly. Long-term investment in open source AI models is challenging, as there is a need for bigger and more unique models. Good human-generated instruction data sets are currently lacking in the open source community. Investing in larger and longer-term artifacts is necessary to achieve high-quality open source models like GPT-4. Money and high-quality data are major challenges in developing advanced open source models. Crowdsourcing diverse conversations and quality checking can help fill the gap in instruction data sets. Copying existing models with minor variations is not interesting or productive; more directed efforts and collaboration are needed to match GPT-3.5 capabilities across different domains. Releasing models that do something different and unique can lead to interesting outcomes.
The importance of creativity and experimentation in AI
Challenging the team to do something different with every model release. Emphasizing the need for creativity and different ideas in the community. Advocating for trying new things and being willing to do weird stuff. Importance of fine tuning as a way to experiment and be different. Driving down the cost of experimentation at Mosaic. Acknowledging the need for more work on data sets and their limitations. Suggesting creating high-quality data sets as a contribution without compute budget. Recognizing that data work is often unglamorous but crucial. Highlighting opportunities for improvement in existing data sets. Comparing and contrasting datasets to learn from them.
Exploring and improving existing datasets for AI models
Comparing and contrasting datasets like Pile Wikipedia and Red Pajama Wikipedia. Red Pajama dataset often only has the first paragraph of an article. Pile dataset is missing information in tables and bulleted lists. Consider recreating a Wikipedia dataset in Markdown format. Common Crawl has interesting variations with HTML, JavaScript, etc. JavaScript in Common Crawl mostly minified and excluded from MPT models. Working on filtering JavaScript data for better results. Joining Databricks to have access to better tools for working with massive datasets. Goal at Mosaic is to produce constantly up-to-date versions of datasets like common crawl, Wikipedia, and GitHub. Open sourcing the project to benefit the community and attract more customers. Aligning Mosaic's business model with open source and open data initiatives. Building the company around a product rather than just selling secrets or efficiency. Driving down the cost of training models to expand the market size. Challenging the team to make machine learning easier and cheaper for everyone. Revolutionizing open source work on inference infrastructure for production use.
The impact of model size and shape on inference performance
Open source is revolutionizing the work being done on inference. The panel discusses the intentional model size choices, such as GPT-3 with approximately 30-33 billion parameters. Inference on a single GPU without low precision or other optimizations is possible with the chosen model size. Distillation may allow for separate architectures for training and inference. The choice of 3B model size was based on latency targets and ablation testing on A100s. The current inference latency is around 250 milliseconds P95, which is considered an amazing result. Testing on H100s shows improved performance compared to standard inference times. Model shape is becoming important in addition to model size, with wider models showing better GPU utilization. There is potential for even larger models that can utilize the faster GPUs effectively. Replitz plans to open source their technology to make it more accessible and portable. The choice of a 3B model allows it to run locally with limited resources and be used as a code assistant chat bot by the community.
The potential of wide models and the importance of collaboration
The elegance of a 3 billion parameter model that performs well across the board and is open sourceable. Wide models can take advantage of CPUs. Neural network pruning works well on CPUs. There's more we can do beyond quantizing models for CPUs. Excitement about working together with Mosaic Plus DataBricks. Questions from attendees will be addressed on Twitter.
Chapters
- Jonathan Frankel joining Databricks as Chief Neural Network Scientist
- The recent release of the 30B model and its impact on the industry
- Exploring the balance between H100s and new chips for best performance
- Benefits of training larger models with longer context size
- Challenges and advantages of long contexts in generative models
- Exploring data mixes and architecture improvements in LLMs
- The importance of open source in AI and the challenges it faces
- The challenges and potential of open source AI models
- The importance of creativity and experimentation in AI
- Exploring and improving existing datasets for AI models
- The impact of model size and shape on inference performance
- The potential of wide models and the importance of collaboration
Jonathan Frankel joining Databricks as Chief Neural Network Scientist
00:00 - 06:53
- Jonathan Frankel is joining Databricks as the Chief Neural Network Scientist, pending regulatory approval
The recent release of the 30B model and its impact on the industry
00:00 - 06:53
- The recent release of the 30B model is exciting because it provides a commercially usable artifact for the community
- Training bigger models is more expensive but they have shown improvements in metrics
- The team trained a 7B model and then a 30B model using the same code base
- The H100 GPUs were used for training and performed exceptionally well
- Optimizations were made to squeeze as much power as possible from the architecture
- The team plans to push boundaries with future releases using H1_UNDR GPUs
Exploring the balance between H100s and new chips for best performance
06:24 - 12:47
- GPUs are getting three times faster every three years
- Questioning the right architecture to balance H100s and new chips for best performance
Benefits of training larger models with longer context size
06:24 - 12:47
- Trained a half trillion tokens on the 7B model over the weekend
- 30B model has an 8K context size
- Training a model with larger context size is not significantly more painful
- Longer sequence lengths can be achieved for free with bigger models
- Accuracy does not seem to be affected by the larger context size
Challenges and advantages of long contexts in generative models
12:25 - 19:17
- Evaluation of generative models is challenging
- Short sequences tend to perform well on certain tasks, but that doesn't mean they are the best models overall
- Models currently don't support long contexts, but even if they do, it's unclear if they can effectively use them for meaningful attention across the entire document.
- Retrieval tasks benefit from longer context lengths, but the main advantage of longer contexts is being able to do attention across the document.
- Benchmarking long sequence models is challenging, as human evaluation with a 64K sequence length would require reading an entire book before performing a task.
- Repl applications for long context models can provide user feedback quickly and determine whether longer contexts help or hurt performance.
- Benchmark evaluations help distinguish between bad and good models, but may not provide clear distinctions among good models.
- Evaluations have limitations in providing insight into model selection, as different data mixes and metrics may yield similar results.
Exploring data mixes and architecture improvements in LLMs
18:48 - 25:03
- Evals can help separate a bad model from a good model, but not distinguish among good models.
- Experimenting with different data mixes for the replica model could reveal distinctions in popularity and completion rate.
- There are improvements in terms of data parts and architecture in LLMs.
- Optimizing architectures for H100s and exploring mixture of experts are exciting developments.
- Lower precision training, floating point A, and low bit optimizers are being explored to improve training speed and memory usage.
- Alem Foundry is an open-source GitHub repo that contains all of their LMM code. It allows outsiders to see their development progress.
The importance of open source in AI and the challenges it faces
24:37 - 31:35
- An open source implementation of GPT-3 has been added and is being tested
- Mosaic is committed to open source and sharing their models
- Collaboration among open-source groups is important for innovation
- The future of technology relies on open-source AI efforts
- Science may slow down without the sharing culture of big labs
- Individual researchers should consider working for companies committed to open source
- The power of individuals and collaboration can drive progress in the field
- Open source alliances are crucial for the future
- The AI ecosystem has seen significant growth in contributors and repositories
The challenges and potential of open source AI models
31:11 - 38:07
- The AI open source ecosystem has grown significantly, with a large number of contributors and GitHub repositories focused on AI.
- Open source is still lagging behind in some aspects, but catching up quickly.
- Long-term investment in open source AI models is challenging, as there is a need for bigger and more unique models.
- Good human-generated instruction data sets are currently lacking in the open source community.
- Investing in larger and longer-term artifacts is necessary to achieve high-quality open source models like GPT-4.
- Money and high-quality data are major challenges in developing advanced open source models.
- Crowdsourcing diverse conversations and quality checking can help fill the gap in instruction data sets.
- Copying existing models with minor variations is not interesting or productive; more directed efforts and collaboration are needed to match GPT-3.5 capabilities across different domains.
- Releasing models that do something different and unique can lead to interesting outcomes.
The importance of creativity and experimentation in AI
37:38 - 44:23
- Challenging the team to do something different with every model release
- Emphasizing the need for creativity and different ideas in the community
- Advocating for trying new things and being willing to do weird stuff
- Importance of fine tuning as a way to experiment and be different
- Driving down the cost of experimentation at Mosaic
- Acknowledging the need for more work on data sets and their limitations
- Suggesting creating high-quality data sets as a contribution without compute budget
- Recognizing that data work is often unglamorous but crucial
- Highlighting opportunities for improvement in existing data sets
- Comparing and contrasting datasets to learn from them
Exploring and improving existing datasets for AI models
43:53 - 50:33
- Comparing and contrasting datasets like Pile Wikipedia and Red Pajama Wikipedia
- Red Pajama dataset often only has the first paragraph of an article
- Pile dataset is missing information in tables and bulleted lists
- Consider recreating a Wikipedia dataset in Markdown format
- Common Crawl has interesting variations with HTML, JavaScript, etc.
- JavaScript in Common Crawl mostly minified and excluded from MPT models
- Working on filtering JavaScript data for better results
- Joining Databricks to have access to better tools for working with massive datasets
- Goal at Mosaic is to produce constantly up-to-date versions of datasets like common crawl, Wikipedia, and GitHub
- Open sourcing the project to benefit the community and attract more customers
- Aligning Mosaic's business model with open source and open data initiatives
- Building the company around a product rather than just selling secrets or efficiency
- Driving down the cost of training models to expand the market size
- Challenging the team to make machine learning easier and cheaper for everyone
- Revolutionizing open source work on inference infrastructure for production use
The impact of model size and shape on inference performance
50:13 - 57:47
- Open source is revolutionizing the work being done on inference.
- The panel discusses the intentional model size choices, such as GPT-3 with approximately 30-33 billion parameters.
- Inference on a single GPU without low precision or other optimizations is possible with the chosen model size.
- Distillation may allow for separate architectures for training and inference.
- The choice of 3B model size was based on latency targets and ablation testing on A100s.
- The current inference latency is around 250 milliseconds P95, which is considered an amazing result.
- Testing on H100s shows improved performance compared to standard inference times.
- Model shape is becoming important in addition to model size, with wider models showing better GPU utilization.
- There is potential for even larger models that can utilize the faster GPUs effectively.
- Replitz plans to open source their technology to make it more accessible and portable.
- The choice of a 3B model allows it to run locally with limited resources and be used as a code assistant chat bot by the community.
The potential of wide models and the importance of collaboration
57:18 - 59:32
- The elegance of a 3 billion parameter model that performs well across the board and is open sourceable
- Wide models can take advantage of CPUs
- Neural network pruning works well on CPUs
- There's more we can do beyond quantizing models for CPUs
- Excitement about working together with Mosaic Plus DataBricks
- Questions from attendees will be addressed on Twitter