Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and al

[Cognitive Revolution] The Tiny Model Revolution with Ronen Eldan and Yuanzhi Li of Microsoft Research

Sat Jul 01 2023

AI engineerTiny StoriesGPT-4language modelsreasoninginterpretability

Layden.Space Blog Post and Conference on the Rise of the AI Engineer
Tiny Stories Dataset and Models
Creating the Tiny Stories Dataset
Generating Stories with GPT-3.5 and GPT-4 Models
Reasoning Abilities in Language Models
Reasoning and Grammar in Language Models
Consistency and Grammar in Language Models
Size and Coherence of Language Models
Emergence of Reasoning in Language Models
Balancing Breadth and Depth in Language Models
Language Models vs. Human Language Production
Attention Mechanisms and Reasoning in Language Models
Interpreting Attention Mechanisms in Language Models
Interpreting Neurons in Language Models
Interpretability of Neural Networks
Challenges in Interpreting Neural Networks

Layden.Space launched a blog post and conference on the rise of the AI engineer. Ronan Elden and Yuanjuli of Microsoft Research created a small natural language dataset called Tiny Stories using GPT-4 to systematically create 1 million children's stories. The dataset aims to keep the authenticity of natural language while reducing overall complexity. GPT-4 is better at generating diverse and creative stories compared to GPT-3.5. The RLHS version of the GPT-4 model is good at following instructions and combining words into a story. Reasoning capability is a rare skill that language models need to learn. The most common noun seen so far is 'dog', not 'cat'. Increasing the size of the model can lead to new capabilities. There is a trade-off between the breadth and depth of the data set or model capabilities. Children have different incentives than language models when producing language. Instructions in language models need multiple iterations of attention to percolate into the tokens of a story. Neurons in a neural network can be thought of as coordinates in a vector space. Interpretability appears as a side effect of small models aligning with meaningful tasks. The theory behind the interpretability of neural networks is still in its early stages.

Layden.Space Blog Post and Conference on the Rise of the AI Engineer

00:00 - 07:34

Layden.Space launched a blog post and conference on the rise of the AI engineer

Tiny Stories Dataset and Models

07:11 - 15:05

Ronan Elden and Yuanjuli of Microsoft Research created a small natural language dataset called Tiny Stories using GPT-4 to systematically create 1 million children's stories
They trained a series of models ranging in size from 1 million to 33 million parameters to explore language model performance and behavior
The models developed reasoning abilities, including understanding grammar, learning facts, and acquiring logical micro-skills like negation and exclusion
Distinct attention heads were identified, including distance heads that reflect token distance and semantic attention heads that focus on meaning
Individual neurons in small models corresponded to human interpretable consequences
Small models are more interpretable than large models due to their size
Curriculum learning approaches have potential for creating specialized, small-scale models that efficiently solve specific problems
Playing around with bigger models on the HuggingFace website can deepen understanding of language model concepts and reasoning ability

Creating the Tiny Stories Dataset

14:39 - 23:06

The main goal is to create a smaller dataset and model that can still provide insights about large language models (LLMs)
There have been attempts to create synthetic or non-synthetic datasets that reflect certain aspects of language, but none that integrate all dimensions together in a manageable size
The motivation behind creating the tiny stories dataset was to bring back the fast experiment iterations of the past, like with CIFAR for vision models
The dataset aims to keep the authenticity of natural language while reducing overall complexity
The tiny stories dataset consists of structured stories with key elements of grammar, facts, and reasoning
The stories are designed to be understandable by small children and cover a diverse range of knowledge
Creating a structured story allows for the combination of grammar, facts, reasoning, and other elements within each story

Generating Stories with GPT-3.5 and GPT-4 Models

22:37 - 30:59

Creating a structured dataset of stories using GPT-3.5 and GPT-4 models
GPT-4 is better at generating diverse and creative stories compared to GPT-3.5
Getting a diverse dataset is challenging as repetitive stories are common
A list of simple words was used to generate creative stories with random combinations
Around 1.5 million stories were created using this method
GPT-4 can combine three given words fluently in a story, while GPT-3.5 sometimes struggles
The repetitive stories about children scared of slides may be due to the model generating the most likely stories without any conditions
Access to different versions of GPT-4 with varying safety features was available

Reasoning Abilities in Language Models

30:34 - 38:09

The RLHS version of the GPT-4 model is good at following instructions and combining words into a story
The base GPT-4 model struggles with understanding interactions and requires fine-tuning
Over-indexing on examples in a few-shot approach can lead to issues
Adding features like plot twists, bad endings, and dialogue adds diversity to the stories
The space of possibility for word combinations is about a thousand times larger than the actual data set used to train the models
Generating a million stories with GPT-4 would cost around $10,000 at retail price
Using a curriculum approach, starting with simple words, is an intriguing concept for training language models
Training models on pure logic or abstract algebra could potentially improve reasoning abilities
Coding or logic training can help models learn important concepts like looking back or checking surroundings
There is some evidence that coding improves reasoning in language models

Reasoning and Grammar in Language Models

37:47 - 46:16

The model's ability to write code does not necessarily mean it can reason
The explanations for the model's coding capabilities are simpler than expected
Components required for coding are already present in the neural network, but they are simple
A synthetic task called Lego showed that pre-trained BERT models grasp reasoning faster
Pre-training gives rise to attention heads that help with reasoning tasks
Reasoning is often misunderstood and debated in AI research
Reasoning involves consistency and coherence in generating text
Basic logics and grammar rules are part of reasoning capabilities
Auto-completion examples demonstrate different levels of reasoning abilities

Consistency and Grammar in Language Models

45:50 - 54:30

The most common noun seen so far is 'dog', not 'cat'
Smaller models tend to complete sentences with the word 'dog' because it appears more frequently
Reasoning and planning are intertwined in natural language processing
Consistency is a key aspect of reasoning in natural language generation
Language models need to generate text that is consistent with the prompt
Grammar is an important capability for generating words in a sentence
Semantic understanding helps models determine relevant nouns and actions
Models vary in their reasoning capabilities, ranging from basic grammar to first or second order logic
Reasoning can be broken down into micro skills, similar to basketball training
There is a continuum of reasoning abilities for both humans and language models
'Murder' is a term with various meanings, but there may be a process of gradual replacement of memorization by concrete circuits solving specific challenges

Size and Coherence of Language Models

54:02 - 1:02:38

Increasing the size of the model can lead to new capabilities
Generative models need to be able to speak coherent English
Tiny stories provide a smaller dataset for observing model behavior
The size of the model affects coherence and reasoning abilities
Different levels of difficulty exist in completing text prompts
Emerging abilities include following grammatical rules and understanding facts
The theory behind these emergent abilities is still unclear
GPT-2 struggles with exclusion concepts, while smaller models can handle them
Data noise may have hindered GPT-2's ability to learn certain things
Consistency is a focus for smaller models like Tiny Stories data set
Large language corpora prioritize learning facts over reasoning skills
Overloading models with facts makes reasoning capabilities more relevant

Emergence of Reasoning in Language Models

1:02:23 - 1:10:39

Reasoning capability is a rare skill that language models need to learn
Models only start to learn reasoning when the loss is minimized to a certain extent
A small difference in loss can indicate an emergent capability for language models
The size of the model and the amount of data can affect the emergence of reasoning
Language models have limited capacity and need to prioritize learning relevant information
Curriculum development and intentional design can help balance knowledge and ability training
Web-scale data may not be well-balanced for knowledge and ability training
Synthetic data with emphasis on abilities can be designed to improve model training

Balancing Breadth and Depth in Language Models

1:10:28 - 1:18:41

There is a trade-off between the breadth and depth of the data set or model capabilities
The entire web is very broad with a large vocabulary and requires a lot of knowledge to capture
Depth refers to the ability to infer from learning the dataset
There should be an optimal ratio between breadth and depth when training a model
Teaching reasoning alone without knowledge is an extreme case that may not be effective
Combining reasoning with knowledge could lead to powerful techniques for training models
Separating knowledge and reasoning into two different modules in language models may not allow for effective combination of abilities
Mixing strategies throughout the training process may be necessary to avoid catastrophic forgetting
The model has no incentive to combine different modalities if they are trained separately
Curriculum learning is a non-trivial task that requires careful design and implementation
Grammar tends to emerge before consistency and creativity in language development, but this may vary among individuals

Language Models vs. Human Language Production

1:18:13 - 1:27:02

Children have different incentives than language models when producing language
Language models prioritize grammar, while children prioritize their desired outcome or incentive
Language models only need to be consistent within the same sentence for correct grammar
Children have a limited number of active entities in their working memory during a conversation
Language models like GPT-2 XL only need to look at the previous sentence to generate completions
Language models do not require global consistency beyond consecutive sentences
Humans have agency and care more about their desired outcome than grammar
The analogy of human children learning with reinforcement learning and parents' feedback is applicable to language model training
Depth in language models refers to how many times information can percolate between tokens
More layers in language models are important for reasoning consistency and context understanding

Attention Mechanisms and Reasoning in Language Models

1:26:45 - 1:35:43

Instructions in language models need multiple iterations of attention to percolate into the tokens of a story
Reasoning in language models also requires multiple layers of percolation between tokens
Facts, on the other hand, can be looked up in a single attention block with a lookup table
The dimension of the vector space in language models allows for more entities and facts to be included
Attention relationships are not immediately transitive and require multiple iterations to create transitivity
The number of layers in a model is connected to the number of reasoning steps required for completion
Models can find sophisticated ways to perform multiple leaps of reasoning within a single layer
There are two types of attention heads observed: distance-based and semantic-based
The distance-based attention heads resemble the alibi scheme used in long context windows

Interpreting Attention Mechanisms in Language Models

1:35:19 - 1:43:30

The model in alibi research uses a substitute for positional embeddings called positional base or distance base attention
Shorter attention heads are responsible for learning grammar, while longer ones ensure content consistency and association of words
The model needs to consider approximate words, recent words, and important entities to complete the next word
In alibi research, attention decays with distance inside the text
There is a dichotomy between attention heads that care about distance and those that care about semantics
Semantic attention focuses on main character names and relevant objects in sentences to ensure consistency in generated text
It is surprising that the model can give meaning to both attention heads and neurons when it is small enough
As transformers become larger and deeper, they become less interpretable
Neurons in middle layers of MLP can be activated or not, and their activation corresponds to specific tokens in stories

Interpreting Neurons in Language Models

1:43:02 - 1:51:20

Neurons in a neural network can be thought of as coordinates in a vector space
There is no clear pattern or meaning to the activations of neurons in larger models like GPT-2 XL
In smaller models, there are neurons that consistently activate when the main character of a story is introduced
These neurons likely help the model copy the main character's name to different places during generation
Smaller language models may be more interpretable compared to larger ones because they focus on basic tasks
The most efficient solution for small models is one that aligns with meaningful tasks, while larger models may have more complex and messy solutions

Interpretability of Neural Networks

1:50:52 - 2:00:01

Interpretability appears as a side effect of small models aligning with meaningful tasks
The question of whether behaviors observed in the tiny stories dataset will translate to LLMs is still open
There may be universal phenomena in LLMs that are independent of dataset and architecture
The hope is that universality exists to save energy and enable more research on LLMs
Future work aims to extend the capability of the tiny stories dataset
LLMs are considered a mathematical miracle due to their ability to synthesize new content and show signs of reasoning
Tiny stories provide a compact example of interesting generalization and emergence in neural networks
The LEGO project inspired this research by exploring the attention mechanism in models
'Transformer feed forward layers or key value memories' is another inspiring paper on interpreting neural networks

Challenges in Interpreting Neural Networks

1:59:34 - 2:05:21

The theory behind the interpretability of neural networks is still in its early stages
Understanding what's going on inside the model is still a challenge
There is no reason to assume that we will ever fully understand neural networks, just as we have limited understanding of how the human brain works
The solutions found by neural networks may be messy and not easily interpretable
Partial interpretability may be possible for small examples or insights about big networks, but full interpretability is unlikely
A different approach, such as talking to the model and studying its intentions, may be needed for interpretability
An analogy can be made with horseback riding, where we can align the behavior of horses with our needs without fully understanding their inner workings
Despite not fully understanding neural networks, they can still be useful and aligned efficiently