Replit AI Podcast

02: Unleashing LLMs in Production: Challenges and Opportunities with Chip Huyen

Wed May 24 2023

AI panelslanguage modelsscaling upreal-time machine learninginjecting vectorsreasoning capabilitiesapplications of language modelsevaluating language modelsmodel performanceopen source versus API

Description

The episode covers a series of AI panels at RepliT discussing the challenges and opportunities of putting large language models (LLMs) into production. It explores topics such as scaling up language models, real-time machine learning, injecting vectors into language models, improving reasoning capabilities, applications of language models, evaluating language models, evaluating model performance, and closing remarks. Key insights include the importance of engineering work with LLMs, the accessibility of language models, the trade-offs between retraining and adapting models, and the considerations for evaluating model performance. The episode concludes with an invitation to continue the conversation and try LLMs at Repliit.

Insights

Engineering work is crucial for working with LLMs

Working with large language models (LLMs) requires a lot of engineering work, not just LLM optimization. Building bespoke LLMs from scratch is challenging and expensive for most companies.

Scaling up language models has become more accessible

The accessibility of language models has improved significantly compared to a year ago. Startups can now scale up their language models and achieve better performance.

Real-time machine learning enables dynamic applications

Real-time machine learning is important for making predictions and leveraging fresh data. It has applications in fraud detection, dynamic pricing, and recommended systems.

Injecting vectors improves language model performance

Injecting vectors into the middle of the feed-forward flow can steer language models (LMs) and improve their performance. This approach updates only the activations of the model, not the weights.

Improving reasoning capabilities in language models

Developing language models with larger context windows can improve their reasoning capabilities. Retrieval models and efficient context-based learning are promising techniques for large-scale transformers.

Applications of language models in various domains

Language models have applications in recommendation systems, fraud detection, and workflow experiences for data scientists. Building bespoke models allows for optimization and customization according to specific use cases.

Evaluating language models requires careful consideration

Evaluating language models involves benchmarking, human evaluation, and A/B testing. It is important to consider real-life use cases and independent evaluation to avoid overfitting to benchmarks.

Considerations for evaluating model performance

Model performance should be evaluated through user testing and comparing responses from different chatbot platforms. Factors like latency, cost, and engineering complexity should be taken into account when building the best product.

Open source versus API in the AI space

The discussion touched on the advantages and disadvantages of open source, bespoke models, and API-based models. Issues like data ownership and inference services need to be considered.

Continuing the conversation and trying LLMs at Repliit

The speakers are open to continuing the conversation on Twitter and invite listeners to visit their office and create a community in Southern Valley. Users can register for RepliPro to try their LLMs.

Summary

Transcript

AI Panels at RepliT

00:00 - 07:08

The podcast is the second installment of a series of AI panels at RepliT.
The panelists discuss the challenges and opportunities of putting large language models (LLMs) into production.
Chip Huyan, CEO of RepliT, shares his personal experience with deploying LLMs and the challenges he faced.
He emphasizes that working with LLMs requires a lot of engineering work, not just LLM optimization.
Mikale from the AI team at RepliT mentions their recent release of a BSPOC model and the engineering work involved in creating and deploying it.
It took them about a month to build the model, including tasks like data pipelines, vocabulary creation, and ablation studies.

Scaling Up Language Models

06:46 - 14:07

The team initially worked on data pipelines, vocabulary, and ablation studies before deciding to scale up the project.
They released a core model with 2.7 billion parameters in just three days on 256 GPUs.
The accessibility of language models has improved significantly compared to a year ago.
The team trained smaller models to test their performance before running the larger model.
The results of the larger model exceeded expectations, leading to more ambitious plans for future models.
Engineering at startups often involves taking risks and embracing uncertainty.
Building bespoke language models from scratch is still challenging and expensive for most companies.
Working with Mosek helped expand the engineering team by ten times.
Access to advanced language models has become more accessible in a short period of time.

Real-Time Machine Learning

13:43 - 21:08

Real-time machine learning is important for making predictions and leveraging fresh data.
Use cases for real-time machine learning include fraud detection, dynamic pricing, and recommended systems.
Updating models in real-time involves leveraging fresh information to update the model weights.
Continual learning is the process of using this information to update the model.
Incorporating context into the model itself yields better performance compared to incorporating it through prompts.
Retraining models from scratch periodically is a common approach, but there are ongoing research efforts to find more efficient solutions.
Fine-tuning and adapter approaches are being explored as alternatives to retraining the entire model.
Efficient retrieval and context-based learning are promising techniques for large-scale transformers, but cost constraints need to be considered.
The approach chosen depends on efficiency, performance, and cost trade-offs.
Different stages of model development offer varying levels of performance at different costs.

Injecting Vectors into Language Models

20:44 - 28:45

The income tax holding can be compared to different layers of a stock pyramid, where the higher the layer, the more performance and cost it entails.
In software, there is an analogy where you don't want to change the kernel or ship new application layer software all the time, but you do want to update the data frequently.
There is a technique called vector additions that injects vectors into the middle of the feed-forward flow to steer language models (LMs).
This approach updates only the activations of the model, not the weights.
The concept is similar to prompt tuning and adapters in terms of injecting certain vectors or weights at specific layers during training.
Finding a trade-off between retraining models and adapting them with advanced techniques is crucial due to cost considerations.
Longer token lengths in context may affect a model's ability to remember information from the beginning and process efficiently.
An experiment showed that retrieval models perform similarly to GPT models with longer context windows, suggesting possible retrieval mechanisms at play.
Personal experiments indicate that including more context improves responses but may require breaking prompts into smaller tasks for better understanding.
Adding better reasoning and guiding models towards correct reasoning seems promising for improving performance.

Improving Reasoning Capabilities

28:18 - 35:39

Models with larger context windows are being developed to improve reasoning capabilities.
Long documents can be processed by breaking them into smaller parts and generating embeddings for each part.
Large language models today are not conditioned on joint embeddings, but rather on the embeddings of the raw text.
Companies are using retrieval techniques to analyze long documents and generate responses based on related chunks.
Context length is a practical and theoretical constraint in natural language processing.
Claypot offers a solution by combining fuzzy signals from embeddings with more exact signals for efficient analysis.
Combining generation and classification techniques can enhance the analysis of different types of signals.
Embeddings can be thought of as a way to summarize longer histories for scalability.

Applications of Language Models

35:14 - 43:07

Embeddings can be used to keep long-term information about someone or an item, while recent activities can be used as short-term interest for recommendation systems.
Fraud detection can use transaction data as unstructured data and feed it into a model to detect fraud based on user behavior and transaction characteristics.
ClayBot is a service that focuses on the workflow experience for data scientists, allowing them to leverage different data sources and signals with varying latency requirements.
ClayBot does not build its own models but allows users to generate signals using language models (LMs) either on-premises or through APIs.
Using APIs extensively has pros such as availability out of the box, but there may be pain points related to pricing and other factors.
Building a bespoke model allows for optimization and customization according to specific use cases, providing better performance and low latency.
There are two visions in the AI space: one where mega models learn everything and provide high-quality tokens regardless of input, and another where domain-specific LMs are engineered into products with different capabilities.
Latency is an important consideration in determining when users are willing to wait longer versus getting immediate results.

Evaluating Language Models

42:37 - 50:12

Engineering is required to make a product work with different domain-specific LMs.
Latency is an important consideration for user experience.
Complex products can benefit from using 5 to 50 LMs.
Domain-specific training is often superior for most use cases.
Open source, bespoke models, and API-based models were debated in terms of their advantages and disadvantages.
The fine-tuned Replic model performs similarly to the stockholder model but is cheaper, faster, and more portable.
Human evaluation involves testing on a benchmark released by OpenEI called Codex, which consists of natural language prompts and Python code generation tasks.
Benchmarking in the industry lags behind real-life use cases and requires independent evaluation.
A/B tests showed that users received more suggestions from the new model compared to previous models.
Cheating in benchmarks has become a concern with prompting before human evaluation.

Evaluating Model Performance

50:05 - 57:39

The benchmark starts out useful and overtime just gets to become almost a counter signal.
Performing really high on a benchmark can mean overfitting to that benchmark and not performing well in real-world usage.
Shipping something and testing it with users is the best way to evaluate its performance.
User testing is a focus for the podcast hosts.
Evaluating chatbots is challenging, but opening tabs with different chatbot platforms and comparing their responses is one way to do it.
Building something more powerful than existing models would be necessary for better evaluation.
Different communities have different approaches to testing philosophy.
Taking questions from the community is the next topic of discussion.
Hallucination is an important topic in natural language processing panels in the Netherlands.
The conversation shifts back to discussing general models versus domain-specific fine-tuning.
Different data mixtures yield different results in pre-training models.
There will likely be many pre-trained models with different strengths and weaknesses.
Fine-tuning models like GPT-3 can still be done for specific tasks or domains.
Building the best product depends on factors like latency and cost, which may favor certain approaches over others.
Engineering costs and maintenance costs are considerations when adding additional layers or classifiers for preprocessing data or routing queries.
Retrieval systems already use similar techniques of embedding queries, running similarities against an index, and post-processing results.
Companies face many decisions when embarking on NLP projects, such as starting from scratch or using base models, choosing base models, and deciding how much fine-tuning is needed for different domains or tasks.
[Speaker's name] acknowledges the importance of companies providing tooling and services in this space to simplify these decisions for developers.
Open source versus API is an interesting discussion, considering issues like data ownership and inference services.
Further in-depth discussions are desired to explore topics like open source versus API and data ownership.

Closing Remarks

57:12 - 58:44

Open source versus API was not discussed much, but it raises issues like data ownership and inference service.
There are companies emerging to do evaluations of models.
The speakers are open to continuing the conversation and answering questions on Twitter.
They invite listeners to visit their office and create a community in Southern Valley.
To try their LLMs at Repliit, users can register for RepliPro.