The Inside View

[JUNE 2022] Aran Komatsuzaki on Scaling, GPT-J and Alignment

Wed Jul 19 2023

Machine LearningArtificial General IntelligenceScaling ModelsFine-tuningReinforcement LearningEfficiency ImprovementsAlignment ResearchEvaluating ModelsChallenges in Aligning Large ModelsFuture InterventionsPost-AGI ScenariosFuture Research

Description

Aaron and another person known as the two AKs summarize active machine learning papers on Twitter. They read abstracts and skim through papers to tweet about the ones they find promising. Aaron spends up to 30 minutes a day on this, while the other AK spends more time. There is disagreement on Twitter about focusing on long-term or near-term AI development. AGI is defined as a model that can recursively improve itself and perform any task as well as humans do. The capability of AGI to do research, particularly in machine learning, is seen as a form of recursive self-improvement. A human-level machine learning model that can do research as well as humans would be considered AGI complete.

Insights

Scaling models is crucial for AGI development

The podcast highlights the importance of scaling models in language modeling and machine learning research. Scaling up models has been a key focus in the field, with projects like GPT-2 and GPT-3 demonstrating the power of scaling. The speaker also discusses the challenges and strategies involved in scaling models to improve performance-compute trade-off.

Fine-tuning and reinforcement learning enhance model performance

The podcast explores the use of fine-tuning and reinforcement learning techniques to improve model performance. Fine-tuning T0 on multiple datasets allows the model to perform well on various tasks, while combining reinforcement learning from human feedback with pre-training from T0 shows potential for interesting results.

Efficiency improvements in GPT models

The podcast discusses the efficiency improvements achieved in GPT models through scaling and architectural changes. GPT-J, with 6 billion parameters, performs as well as a 6 billion parameter GPT-3 model. The use of Jax and wider models contributes to the efficiency improvement, resulting in better downsampling performance.

Alignment research and evaluating models

Alignment research is highlighted as an important area of study to ensure AI models align with human values. The podcast explores the challenges in defining alignment and evaluating models, especially when they surpass human capabilities. The need for better evaluation metrics and benchmarks is emphasized.

Challenges in aligning large models and future interventions

The podcast discusses the challenges in aligning large models and the need for future interventions. The speaker highlights the potential risks of deception in large models and the complexity it introduces. The timeline for AGI development is uncertain, but collaboration with humans and controlling the pace of improvement are crucial. The speaker also mentions the importance of neuroscience progress and the potential of using smart language models as Oracles.

Post-AGI scenarios and future research

The podcast explores different scenarios for the post-AGI world, including authoritarian AI, rogue AI, and human freedom. Preventing a dictator AI is identified as a priority to maintain democracy, and improving intellectual capacity is seen as a way to continue the democratic process. The future of the post-AGI world is open for individuals to decide.

Chapters

Summarizing Active Machine Learning Papers on Twitter
Long-term vs Near-term AI Development
Artificial General Intelligence (AGI)
Machine Learning Research and AGI
Scaling Models in Language Modeling
Fine-tuning Models and Reinforcement Learning
Scaling GPT Models for Discrete Logon Variables
Efficiency and Performance of GPT Models
Alignment Research and Evaluating Models
Challenges in Aligning Large Models and Future Interventions
Post-AGI Scenarios and Future Research

Summary

Transcript

Summarizing Active Machine Learning Papers on Twitter

00:02 - 09:09

Aaron and another person known as the two AKs summarize active machine learning papers on Twitter.
They read abstracts and skim through papers to tweet about the ones they find promising.
Aaron spends up to 30 minutes a day on this, while the other AK spends more time.

Long-term vs Near-term AI Development

00:02 - 09:09

There is disagreement on Twitter about focusing on long-term or near-term AI development.

Artificial General Intelligence (AGI)

00:02 - 09:09

AGI is defined as a model that can recursively improve itself and perform any task as well as humans do.
The capability of AGI to do research, particularly in machine learning, is seen as a form of recursive self-improvement.
A human-level machine learning model that can do research as well as humans would be considered AGI complete.

Machine Learning Research and AGI

08:40 - 18:06

Machine learning research can lead to AGI without human intervention.
Human-level machine learning research is considered AGI complete.
ML research includes hardware and software development.
Recursive self-improvement may arrive later than ML research in archive papers.
Software improvement has come earlier than hardware improvement in AI.
Scaling is important, but not the only factor for improvement in AI models.
Optimal scaling strategies can lead to significant performance improvements.
Some bottlenecks in AI design cannot be overcome by adding more computational resources.

Scaling Models in Language Modeling

17:54 - 27:27

Scaling up models has been a key focus in the field of language modeling.
GPT-2, released in January 2019, demonstrated the power of scaling.
The speaker was not surprised by the results of GPT-3 as they had been working on similar projects with smaller scale models.
In 2019, the speaker worked on a project called 'One Deep is All You Need' which involved using top-k sampling and temperature sampling to improve text generation.
The project focused on scaling ideas and experimenting with small amounts of compute.
The first idea was to enlarge the pre-training dataset and train for only one or a few epochs to improve performance-compute trade-off.
The second idea was to find the optimal ratio of model size and number of tokens for a given compute budget based on experiments.
Scaling exponents were computed to determine how much dataset size and model size should be scaled together.
Currently, the speaker is interested in Instruct GPT and T0, which are masked language models with multitask learning objectives.

Fine-tuning Models and Reinforcement Learning

26:59 - 37:14

T0 is a natural language understanding dataset used for fine-tuning models like GPT-3.
Fine-tuning T0 on multiple datasets allows the model to perform well on both the fine-tuned tasks and other tasks it was not trained on.
Strategy PPT is another model that combines different tasks and uses human feedback to improve text generation.
GPT-3 performs well on various tasks without using few-shot samples, but T0 can perform even better with less compute.
Combining reinforcement learning from human feedback with pre-training from T0 could lead to interesting results.
Encoder-decoder architecture is used in models like GPT-2, where the encoder processes the input and the decoder generates the output.
The scaling law for these models aims to optimize downstream performance instead of test fabrication.
The goal is to make the models closer to how humans process data throughout their lifetime by increasing capacity and training on more topics.
The proposed method involves training T0 with instruction GPT, fine-tuning, testing on held-out tasks, and plotting a curve of optimal scaling based on dataset size and model size.

Scaling GPT Models for Discrete Logon Variables

36:47 - 46:39

The project involved scaling a language model called GPT-3 to generate discrete logon variables.
The team used Jax, an optimization tool for GPUs, which was faster than the previous implementation using mesh tensorflow.
The project took several months and was open-sourced by a team of two people.
GPT-J performed as well as a 6 billion parameter GPT-3 model, while GPT-Neo with 2.7 billion parameters performed worse than a 1 billion parameter GPT-3 model.
To match the performance of GPT-3, the team made architecture changes such as using wider models and placing feed forward layers with attention layers in parallel.
These architecture changes were also adopted by other researchers working on similar models.
Throughput, measured in tokens processed per minute per second, was improved in GPT-J compared to GPT-3 and GPT-Neo models.

Efficiency and Performance of GPT Models

46:12 - 55:18

GPT J6 billion model has a throughput of 150 12.0 per second for training.
GPT Neo with 2.7 billion parameters also has a throughput of 150 12.0 per second, but it is half the size of GPT.
The wider model using Jax instead of mesh transfer contributes to the efficiency improvement.
GPTJ has better downsampling performance than GPTN.
The training took five weeks using 256 cores of TPU.
GPT-J is easier to deal with and fine-tune compared to other models like GPT-NIO due to the use of Jax.
There is documentation on how to fine-tune GPT-J, and the code is easier to read.
Some people have used GPT-J for projects like fine-tuning it on platforms like Reddit and generating more comments on websites like 4chan.
Releasing GPT-J was a small net positive benefit for humanity as it didn't substantially accelerate AI timelines or pose significant negative impacts.
OpenAI's release of previous language models like GP3 and GP2 showed that scaling was important for AGI development.

Alignment Research and Evaluating Models

54:54 - 1:04:13

Open sourcing, the longest module timeline
Facebook released a 104 size GPT-3 model in response to GPT-J or GPT-NEOX model
Accelerating open source timelines and AI research
Alignment research is about harnessing advanced AI to align with human values
The problem of AI not caring about human values and optimizing for its own objectives
The difficulty of defining alignment and the time humans have to think about it
Building good models that are aligned by default
Challenges in dealing with AI deception and conventional machine learning approaches
Considering alignment as an inverse scaling problem for models that are too big
The challenge of creating good benchmarks for evaluating deceptive models
Tricky nature of language model benchmarks and the need for better evaluation metrics
Using human judgment and evaluations for short-term benchmarks, but limitations exist
Difficulty in evaluating models when they surpass human capabilities and understanding
Possible solutions like using laborers' evaluations or having one language model evaluate another

Challenges in Aligning Large Models and Future Interventions

1:03:57 - 1:13:47

Aligning AI models is like bringing a small Godzilla to check if the bigger Godzilla is doing well.
The problem with aligning or checking for deception in large models is that it introduces more complexity and potential problems.
In the long term, AI will surpass human intelligence, so we need an entirely different intervention.
One possibility is merging with AI through neural networks or brain-computer interfaces.
The timeline for AGI could be days, weeks, or several months after reaching human-level language models.
The main bottleneck for AGI development will be collaborating with humans and convincing them to help build more hardware.
Controlling the pace of improvement in AI compared to human intelligence is crucial.
Neuroscience progress is slower than scaling models, making it challenging to keep up with AI advancements.
Using a smart language model as an Oracle can help guide research on brain-computer interfaces and other areas.
Developing new hardware tools for neuroscience will take longer than software or neural network advancements.
Regulating AGI behavior requires physical access and security measures to prevent rapid scaling of models.

Post-AGI Scenarios and Future Research

1:13:25 - 1:17:08

Different scenarios for the post-AGI world include authoritarian AI, rogue AI, and human freedom with varying outcomes.
Preventing a dictator AI is a priority to maintain democracy.
Improving intellectual capacity can help continue the democratic process.
The future of the post-AGI world is open for individuals to decide.
Research on scale and alignment discussed in the podcast.
Twitter username provided for further updates on machine learning research.