The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
Are Emergent Behaviors in LLMs an Illusion? with Sanmi Koyejo
Sami Kriyaju, an assistant professor at Stanford University, discusses trustworthy AI systems, emergent properties in large language models, metric choice, model performance, evaluating trustworthiness in language models, challenges in evaluating GPT models, applicability of evaluation tests to real-world problems, and building relevant evaluation tests.
Emergence and Metric Choice
The choice of metric plays a significant role in determining emergent behavior in models. Metrics that give all or nothing credit for correct answers contribute to emergent behavior. Partial credit plays an important role in evaluating models for understanding and calibration.
Predicting Sequences and Properties
The association between model size and the probability of getting the next token right is well-established. By multiplying this probability, we can predict the likelihood of getting multiple tokens in a row correct.
Evaluating black box models presents challenges due to lack of public information about the specific model being used and constant changes and tuning.
Applicability of Evaluation Tests
The speaker questions whether relying on a benchmark or creating custom benchmarks is more suitable for solving real problems. They emphasize the need for better coverage and tailored evaluation mechanisms for domain-specific concerns.
Building Relevant Evaluation Tests
The speaker believes that tests are being built for practice, but there is still a gap in understanding what practice means and how to build relevant tests. They suggest exploring personalization and specialization of tests based on individual values and how people interact with technology.
- Emergent Properties in Large Language Models
- Predicting Sequences and Properties in Models
- Metric Choice and Model Performance
- Estimating Model Performance and Evaluation
- Defining and Evaluating Trustworthiness in Language Models
- Evaluation of GPT Models and Challenges
- Applicability of Evaluation Tests to Real-World Problems
- Building Relevant Evaluation Tests
00:03 - 08:02
- Sami Kriyaju, an assistant professor at Stanford University, is a guest on the Twimill AI Podcast. Sami recently moved to the west coast and is enjoying the environment, colleagues, students, and weather. Sami's research agenda focuses on trustworthy AI systems, including measurement, assessment, and mitigation. Recently, Sami's work has been centered around language models and GPT (Generative Pre-trained Transformers). Initially, Sami was more focused on vision in AI but became fascinated with language due to its importance in the field. The goal of Sami's lab is to contribute foundational knowledge that applies broadly beyond specific modalities or applications. Finding a balance between going deep into specific areas and extracting broad themes is important for their work.
Emergent Properties in Large Language Models
07:49 - 15:49
- One of the papers discussed during the podcast challenges the idea of emergent properties in large language models (LLMs). The paper explores how LLMs perform poorly on certain tasks at small scales but improve as the model size increases. Specifically, arithmetic tasks were examined as they are tied to reasoning abilities in models.
- Models that can do reasoning have the potential to implement or run more important tasks, but there are concerns about unpredictable outcomes. The definition of a margin in this context is still being debated and may vary between precise and colloquial definitions. Emergence, characterized by sharp changes and unpredictability, correlates closely with the concept of a margin. Metrics used to measure emergence can have limited influence and may not correlate with what is meaningful in the real world. The choice of metric plays a significant role in determining emergent behavior in models. Metrics that give all or nothing credit for correct answers contribute to emergent behavior. Arithmetic is an example where getting the answer right or wrong determines full credit, but partial credit should also be considered. Scale is required for arithmetic, but it does not necessarily indicate a magical emergent property. The sharp behavior and unpredictability of emergence can be argued against in some discussions. Partial credit plays an important role in evaluating models for understanding and calibration.
Predicting Sequences and Properties in Models
15:26 - 22:34
- Partial credit plays a big role in evaluating models and understanding gradual evolution. The simple model of language models as next word predictors can capture qualitative properties about metrics. The association between model size and the probability of getting the next token right is well-established. By multiplying this probability, we can predict the likelihood of getting multiple tokens in a row correct. This prediction aligns with observed data on arithmetic problems. The emergence curve for overall accuracy resembles the curve predicted by the simple model. The lack of predictability may not be a strong claim, as the simple model can predict behavior for metrics tied to getting everything right.
Metric Choice and Model Performance
22:05 - 29:37
- The podcast discusses the importance of metric choice in predicting sequences and properties in models. An experiment was conducted using auto encoders, which are models used for data compression and denoising. The experiment replaced the usual metric with a measure that gives full credit for close reconstructions and zero credit for far ones. The experiment showed that metric choice significantly affects model performance. Another experiment based on a simple model suggested that even when models are not performing well, there should still be some signal of improvement. The podcast emphasizes the need to use appropriate sampling techniques to capture small values accurately in measurements. The discussed claims are based on a toy representative measurement model designed to capture qualitative properties at scale.
Estimating Model Performance and Evaluation
29:19 - 37:13
- The probability of small models getting correct estimates is low, and sampling more often is necessary to get a good estimate. Small models with more frequent sampling show a transition point where performance increases monotonically. The theoretically predicted curves in the paper match reality well. The scaling laws allow estimation of the number of parameters based on the probability value. The simple model used in the paper may not fully explain complex tasks like context noting. Data fit seems to predict task performance in large-scale models. Metric choice affects model behavior and inferences made about it. Researchers should acknowledge the effect metric choice has on claims and interpret strength correctly. Careful interpretation is needed when choosing metrics for real-world applications. Emergence may be a fuzzy and metric-specific label rather than something fundamental.
Defining and Evaluating Trustworthiness in Language Models
36:56 - 44:55
- Measurement specificity is important and overgeneralization should be avoided. The concept of emergence and its fundamental nature in models is a topic of debate. Context-free claims can be concerning and potentially dangerous when explaining models to the general public or making policies. The "Decoding Trust" paper aims to define and evaluate trustworthiness in language models. Eight perspectives were considered: toxicity, stereotype bias, robustness, privacy, ethics, fairness, distribution robustness. Evaluation metrics were designed to assess model behavior in these aspects of trustworthiness. A toolbox was created for easy evaluation of model trustworthiness with plans for further development. The contribution of the work includes consolidating existing measurements, introducing new ones, and providing a software package for evaluation purposes. GPT models were primarily evaluated and showed overall improvement across various evaluations.
Evaluation of GPT Models and Challenges
44:35 - 52:37
- Models 3.5 and 4 were evaluated and showed improvements overall. There is a tension in building general purpose models that follow human behavior, particularly when it comes to ethical considerations. GPT-4 looked worse in some directions because it followed instructions better, which created a conundrum. The evaluation showed that GPT-4 performed better than 3.5 at following instructions, but there are limitations in what it will respond to. Evaluating black box models presents challenges due to lack of public information about the specific model being used and constant changes and tuning. Despite observations that GPT-4 follows instructions less due to tuning, it still performs better than 3.5 at following instructions overall. Establishing a GPT weather report to track performance over time could be done, but there may be variations due to potential B tests being conducted behind the scenes. Tracking cross-metrics and population sampling would provide more comprehensive insights into model performance. Running the full business requires fewer resources compared to model inference or training. The applicability of tasks and metrics in the benchmark to real-world problems is questioned, and there hasn't been replication in other aspects of research.
Applicability of Evaluation Tests to Real-World Problems
52:19 - 59:56
- The speaker discusses the applicability of tasks and metrics in a benchmark to real-world problems. They question whether relying on a benchmark or creating custom benchmarks is more suitable for solving real problems. The speaker reflects on the level of general purpose that is appropriate for evaluating models and mentions the coverage problem. They argue that the tests they come up with are relevant for real-world use, but acknowledge the need for better coverage. The speaker highlights conversations with healthcare professionals and emphasizes that domain-specific concerns may not be covered by general purpose evaluations. They mention harm as a specific concern in healthcare and education, which requires evaluation mechanisms tailored to those contexts. The speaker expresses optimism about building a large suite of evaluation tests and being able to select subsets relevant to specific domains or contexts. They suggest exploring personalization and specialization of tests based on individual values and how people interact with technology.
Building Relevant Evaluation Tests
59:37 - 1:04:56
- The speaker believes that tests are being built for practice, but there is still a gap in understanding what practice means and how to build relevant tests. The speaker is hopeful that evaluation suites can be extendable and scalable. There is a question about whether the same metrics approach can be applied to different architectures, such as bare LLM versus retrieval augmented scenario. LLMs are considered systems rather than just models because they often have additional components like tool use. It is suggested that separate evaluations of tools could be interesting, allowing for mixing and matching. Evaluations of rag tools can be done independently using existing evaluations of embedding quality. Decoding trust and similar evaluations only focus on input and output from the test's point of view, which may lead to coverage gaps. Understanding training data and data contamination are meaningful gaps in evaluation and testing. There is a concern about models being trained on the test data, which creates incentives for biased results. Examining weights in black box evaluations can provide insights into system behavior, but caveats apply due to lack of transparency in training methods.