The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

BloombergGPT - an LLM for Finance with David Rosenberg

Mon Jul 24 2023

AIMachine LearningLanguage ModelTraining ProcessModel EvaluationModel TrainingContextual AI

Description

David Rosenberg, head of the Machine Learning Strategy team at Bloomberg, discusses the changes in the AI scene over the past five years. The introduction of GPT-3 three years ago was a significant moment in AI. Bloomberg decided to invest in building their own GPT-3 style model called Bloomberg GPT. The training data set for Bloomberg GPT consisted of general purpose training data and finance-specific data collected by Bloomberg since 2007. The finance-specific data included news, financial filings, press releases, and call transcripts. Training a large language model like Bloomberg GPT is an adventure with new ground to explore.

Insights

Training a Large Language Model

Training a large language model like Bloomberg GPT is an adventure with new ground to explore. The team at Bloomberg invested in building their own GPT-3 style model called Bloomberg GPT, which consisted of general purpose training data and finance-specific data collected since 2007.

Training Process and Optimization

The team used the OPT model and training logs from Bloom and Hugging Face as a roadmap for their own training process. They made adjustments to the blue model architecture, such as changing the tokenizer and handling numbers differently. After encountering issues with gradient norm spikes, they fixed a bug related to weight decay in their optimizer and successfully trained the model for 42 days.

Evaluation and Performance

The team evaluated their model using validation sets and benchmark datasets. Their model was competitive with other models in general purpose tasks and significantly better in financial benchmarks. They also experimented with tasks like BQL translation and news headline generation.

Considerations in Model Training

Data set selection, cleaning, relative weightings, vocabulary size, and tokenizer settings are important considerations in model training. Exploring smaller models can be interesting due to their ease of use and potential performance benefits. Optimizing compute budget and getting the best use out of data are ongoing areas of learning.

Training from Scratch, Fine-Tuning, and Context

Training from scratch, fine-tuning, and context have different relationships. Large context is useful for adding context that isn't in the base model. Instruction tuning conditions the model to answer specific types of questions. Multi-shot learning with a big context window is helpful for including many examples and information. The model is still in research and experimentation phase, with ongoing testing to solve existing problems and address safety concerns.

Internal Usage and Ethical Considerations

The company is starting with internal usage of the model and focusing on function rather than safety and reputation. There are interesting questions about reliance on AI for internal tasks and the need for human oversight. Ethical considerations, such as offensive or biased outputs, are a concern, but the company is currently focused on finance topics and basic summarization tests.

Chapters

Changes in the AI Scene
Training Process and Optimization
Training Tools and Evaluation
Considerations in Model Training
Training from Scratch, Fine-Tuning, and Context
Internal Usage and Ethical Considerations

Summary

Transcript

Changes in the AI Scene

00:01 - 07:05

David Rosenberg, head of the Machine Learning Strategy team at Bloomberg, discusses the changes in the AI scene over the past five years.
The introduction of GPT-3 three years ago was a significant moment in AI.
Bloomberg decided to invest in building their own GPT-3 style model called Bloomberg GPT.
The training data set for Bloomberg GPT consisted of general purpose training data and finance-specific data collected by Bloomberg since 2007.
The finance-specific data included news, financial filings, press releases, and call transcripts.
Bloomberg GPT tokenized numerical data differently to ensure consistent representation of numbers.
Training a large language model like Bloomberg GPT is an adventure with new ground to explore.

Training Process and Optimization

06:47 - 14:08

Training large models like the one discussed in this podcast is still an adventure.
The team used the OPT model and training logs from Bloom and Hugging Face as a roadmap for their own training process.
They copied the blue model architecture with some small tweaks, such as changing the tokenizer and handling numbers differently.
In the first version of training, they tried curriculum learning by ordering the data sequentially in time for part of it, but it didn't work well.
They then randomly shuffled all the data and started version one of training, which showed improvement initially but later encountered issues with gradient norm spikes.
They followed a recipe from the OPT paper to address these spikes but didn't see significant improvement.
They discovered that there was a bug related to weight decay in their optimizer, which they fixed in version two of training.
They also made other adjustments involving mixed precision and added an extra layer norm at the beginning for additional protection.
Version two was finally successful and ran for 42 days with steady progress before leveling off after about 75% of the dataset.
The team had a budget constraint towards the end and decided to stop training since performance on downstream tasks was already good enough.
The team consisted of nine people, with four focused on implementation work and three focused on machine learning and data aspects. The rest provided advisory support.

Training Tools and Evaluation

13:42 - 21:04

The team consisted of members working on machine learning, data aspects, optimization, compute aspects, literature review, and evaluation.
They used Amazon SageMaker for training and the SMP platform for optimization.
The team trained their model using PyTorch and made small tweaks to the blue model source code.
They purchased 1.3 million GPU hours at a negotiated rate for training.
During training, they evaluated the model using validation sets from the last month of training data and randomly chosen sets from the training period.
They also evaluated the model on MMOU and BBH tasks.
After training, they performed thorough analysis of performance comparing their model to other models in terms of internal and external tasks.
Their model was competitive with Bloom and OPT models in general purpose tasks and significantly better in financial benchmarks.
They had internal benchmark datasets for sentiment analysis and disambiguation tasks.
They experimented with BQL (Bloomberg Query Language) translation task and news headline generation task.

Considerations in Model Training

20:35 - 27:31

There aren't many open source models available for fine-tuning on a commercial scale.
The effectiveness of fine-tuning on a large dataset is still unclear compared to training from scratch.
Using existing models as an API may not be suitable for those who want to keep data in-house.
The learnings from building the current model can be applied to future model architectures.
Data set selection, cleaning, and relative weightings are important considerations in model training.
Vocabulary size and tokenizer settings are independent of architecture but impact model performance.
Exploring smaller models is interesting due to their ease of use and potential performance benefits.
There is a lot to learn about optimizing compute budget and getting the best use out of data.
The relationship between Train from Scratch, Fine Tune, and Context is complex and requires experimentation.

Training from Scratch, Fine-Tuning, and Context

27:03 - 34:18

Training from scratch versus fine-tuning versus context have different relationships.
Large context is useful for adding context that isn't in the base model.
Instruction tuning conditions the model to answer specific types of questions.
Multi-shot learning with a big context window is helpful for including many examples and information.
A large context window is necessary for interacting with and reasoning about new documents.
FinPile data helps with few-shot learning, potentially reducing the need for a larger window.
Instruction tuning process uses a combination of publicly available datasets and internal data from Bloomberg.
Internal data includes annotation tasks that are reformatted into queries and responses for instruction tuning.
No custom instruction tuning dataset development has been done yet, but it's a possible direction.
Limitations and challenges include time constraints during model building, which led to skipping important steps like experimentation at a smaller scale.
Going back to step one with more resources is the next plan to address these issues.
The model is still in research and experimentation phase, not in production yet.
Testing if the model can solve existing problems better or help with new use cases is ongoing.
Safety concerns around hallucination problem need to be addressed before production use can happen.
Starting with internal usage to evaluate usefulness before considering broader deployment.

Internal Usage and Ethical Considerations

33:51 - 36:17

The company is starting with internal usage and focusing on function rather than safety and reputation.
There are interesting questions about reliance on AI for internal tasks and the need for human oversight.
Ethical considerations, such as offensive or biased outputs, are a concern, but the company is currently focused on finance topics and basic summarization tests.
The guest's project is intriguing and it was great to learn about how they approached it.