Large language models, explained with a minimum of math and jargon [Summary]
Explore the power of AI language models like GPT-3 & GPT-4, their impressive performance, and the concept of 'theory of mind' in AI.
Large language models, explained with a minimum of math and jargon [ELI5 Summary]
Blog Article: https://www.understandingai.org/p/large-language-models-explained-with
Authors: TIMOTHY B LEE AND SEAN TROTT
Introduction
Introduction of ChatGPT: Last fall, a tool called ChatGPT was introduced. It's a type of Large Language Model (LLM), a big brain for computers that helps them understand and generate human-like text. This was a big deal in the tech world, but most people didn't really know about LLMs or how powerful they could be.
Popularity and Understanding of LLMs: Now, lots of people have heard about LLMs and many have even used them. But, understanding how they work is still a mystery for most. They know that LLMs are trained to "predict the next word" in a sentence and need a lot of text data to do this, but the details are often unclear.
How LLMs are Different: Unlike traditional software that's built by humans giving step-by-step instructions to computers, LLMs like ChatGPT are built on something called a neural network. This network learns from billions of words in ordinary language. Because of this, even the experts don't fully understand how LLMs work on the inside.
The Goal of the Article: The article aims to explain what we do know about how LLMs work, in a way that's easy for everyone to understand, without using technical jargon or complex math.
What the Article Will Cover: The article will explain how LLMs represent and understand language using something called word vectors. It will also talk about the transformer, a basic component of systems like ChatGPT. Lastly, it will explain how these models are trained and why they need so much data to work well.
Word vectors
Word Vectors: To understand how language models work, you need to know how they represent words. While we humans use a sequence of letters like C-A-T for "cat", language models use a long list of numbers called a "word vector". This list can be as long as 300 numbers.
Why Use Word Vectors?: Think of it like coordinates on a map. Just like we can represent the location of a city with coordinates (like [38.9, 77] for Washington DC), we can represent words with coordinates in an imaginary "word space". Words that have similar meanings are close together in this space. So "cat" might be close to "dog", "kitten", and "pet". This is useful because it allows the computer to understand relationships between words.
High-Dimensional Space: Words are complex, so we need a lot of dimensions to represent them accurately. While we humans can't really imagine a space with hundreds or thousands of dimensions, computers can handle it just fine.
Word2Vec: This idea of word vectors really became popular when Google introduced a project called word2vec in 2013. They used a lot of text from Google News to train a computer to understand which words appear together often. This helped the computer place similar words close together in the word space.
Vector Arithmetic: An interesting thing about word vectors is that you can do math with them to find relationships between words. For example, if you take the vector for "biggest", subtract "big", and add "small", you get "smallest". This is like saying "big is to biggest as small is to smallest". You can find many other relationships this way.
Biases: Because word vectors are based on how humans use words, they can sometimes reflect human biases. For example, if you subtract "man" from "doctor" and add "woman", you might get "nurse". Researchers are working on ways to reduce these biases.
Usefulness: Despite these challenges, word vectors are very useful for language models. They help the computer understand how words are related and apply knowledge from one word to similar words. For example, if the model learns that a "cat" goes to the vet, it can guess that a "dog" might also go to the vet. Or if it learns that people in Paris speak French, it can guess that people in Berlin speak German.
Word meaning depends on context
Word Meaning and Context: Words can have different meanings depending on the context in which they're used. For example, "bank" can mean a financial institution or the land next to a river. Similarly, "magazine" can refer to a physical publication or an organization that publishes magazines.
Homonyms and Polysemy: When a word has two unrelated meanings, like "bank", we call them homonyms. When a word has two closely related meanings, like "magazine", we call it polysemous.
Language Models and Context: Large Language Models (LLMs) like ChatGPT can understand these differences in meaning based on context. They use different vectors (lists of numbers) to represent the different meanings of a word. For example, there's one vector for "bank" as a financial institution and another for "bank" as land next to a river.
Ambiguity in Language: Unlike traditional software that works with clear and unambiguous data, natural language is full of ambiguities. For example, in the sentence "the customer asked the mechanic to fix his car", it's not clear if "his" refers to the customer or the mechanic. People understand these ambiguities based on context and their knowledge of the world, and LLMs try to do the same.
Word Vectors and Context: Word vectors provide a way for language models to understand the precise meaning of a word in a specific context. This is a fundamental part of how language models work.
Transforming word vectors into word predictions
Transforming Word Vectors into Word Predictions: GPT-3, the model behind ChatGPT, is organized into dozens of layers. Each layer takes a sequence of word vectors (the number lists that represent words) and adds information to help clarify the meaning of each word and predict the next word.
Layers and Transformers: Each layer of GPT-3 is a transformer, a type of neural network architecture. The input to the model is a sentence, with each word represented as a word vector. These vectors are fed into the first transformer, which adds context to the words. For example, it might figure out that "wants" and "cash" are verbs in the sentence "John wants his bank to cash the."
Hidden State Vectors: The transformer modifies the word vectors based on the context it has figured out, creating new vectors known as hidden state vectors. These vectors are then passed to the next transformer, which adds more context. For example, the second transformer might clarify that "bank" refers to a financial institution and that "his" refers to John.
Layers and Understanding: GPT-3 has many layers (the most powerful version has 96). The first few layers focus on understanding the syntax of the sentence and resolving ambiguities. Later layers work to develop a high-level understanding of the text as a whole. For example, as GPT-3 reads through a short story, it keeps track of information about the characters, their relationships, locations, and so on.
Large Word Vectors: GPT-3 uses very large word vectors, with each word represented by a list of 12,288 numbers. This large size allows GPT-3 to keep track of a lot of information about the context of each word.
Final Goal: The final goal of the model is for the last layer to output a hidden state for the final word that includes all the necessary information to predict the next word. This means that by the time the model gets to the end of a sentence or a passage, it has a very detailed understanding of the text.
Can I have your attention please
Transformers and Attention: Transformers, the building blocks of GPT-3, have a two-step process for updating the hidden state (the context-enriched word vectors) for each word. The first step is the attention step, where words "look around" for other words that have relevant context and share information with them. The second step is the feed-forward step, where each word "thinks about" the information gathered in the attention step and tries to predict the next word.
Attention Mechanism: The attention mechanism can be thought of as a matchmaking service for words. Each word makes a checklist (a query vector) describing the characteristics of words it's looking for. Each word also makes a checklist (a key vector) describing its own characteristics. The network compares each key vector to each query vector to find the best matches and transfers information between matching words.
Attention Heads: Each attention layer has several "attention heads", which means the information-swapping process happens several times at each layer. Each attention head focuses on a different task, such as matching pronouns with nouns, resolving the meaning of homonyms, or linking together two-word phrases. The results of an attention operation in one layer become an input for an attention head in a subsequent layer.
GPT-3's Scale: The largest version of GPT-3 has 96 layers with 96 attention heads each, so it performs 9,216 attention operations each time it predicts a new word. This massive scale allows GPT-3 to handle passages with thousands of words and take full advantage of the parallel processing power of modern GPU chips.
A real-world example
Real-World Example: Scientists at Redwood Research studied how GPT-2, a predecessor to ChatGPT, predicted the next word in a sentence. They found that three types of attention heads contributed to the prediction.
Name Mover Heads: These attention heads copied information from the word "Mary" to the final input vector (for the word "to"). GPT-2 uses the information in this rightmost vector to predict the next word.
Subject Inhibition Heads: These attention heads marked the second "John" vector in a way that blocked the Name Mover Heads from copying the name "John".
Duplicate Token Heads: These attention heads marked the second "John" vector as a duplicate of the first "John" vector, which helped the Subject Inhibition Heads to decide that "John" shouldn’t be copied.
Complexity of Understanding LLMs: Despite this detailed analysis, we are still far from having a comprehensive explanation for why GPT-2 decided to predict "Mary" as the next word. It could take months or even years of additional effort just to understand the prediction of a single word. The language models underlying ChatGPT—GPT-3.5 and GPT-4—are significantly larger and more complex than GPT-2, making the task of fully explaining how these systems work a huge and ongoing project.
The feed-forward step
Feed-Forward Step: After the attention heads transfer information between word vectors, a feed-forward network analyzes each word vector and tries to predict the next word. This step doesn't exchange information between words but uses any information previously copied by an attention head.
Structure of the Feed-Forward Layer: The feed-forward layer in GPT-3 consists of neurons that compute a weighted sum of their inputs. The layer is powerful due to its huge number of connections. In the largest version of GPT-3, each feed-forward layer has 1.2 billion weight parameters, accounting for almost two-thirds of GPT-3’s overall total of 175 billion parameters.
Pattern Matching: Researchers found that feed-forward layers work by pattern matching, with each neuron in the hidden layer matching a specific pattern in the input text. The patterns matched by neurons become more abstract in the later layers. Early layers tend to match specific words, whereas later layers match phrases that fall into broader semantic categories.
Adding Information to Word Vector: When a neuron matches one of these patterns, it adds information to the word vector. This information can often be thought of as a tentative prediction about the next word.
Feed-forward networks reason with vector math
Vector Math in Feed-Forward Networks: Feed-forward layers in language models like GPT-2 sometimes use vector arithmetic to predict the next word. This is similar to the way Google's word2vec model uses vector arithmetic to reason by analogy, such as Berlin - Germany + France = Paris.
Real-World Example: Researchers from Brown University found that GPT-2 used this method to answer questions about the capitals of countries. For example, when asked "What is the capital of Poland?", the model started predicting the correct answer, Warsaw, after the 20th layer. The researchers found that the 20th feed-forward layer achieved this by adding a vector that maps country vectors to their corresponding capitals.
Other Transformations: The same model used vector arithmetic in its feed-forward layers to transform lower-case words into upper-case words and present-tense words into their past-tense equivalents. This shows the flexibility and power of vector arithmetic in language models.
The attention and feed-forward layers have different jobs
Division of Labor in Language Models: In language models like GPT-2, attention heads and feed-forward layers have different roles. Attention heads are responsible for retrieving information from earlier words in a prompt, while feed-forward layers allow the model to "remember" information that isn't present in the prompt.
Attention Heads: These components of the model focus on the context provided in the prompt. They help the model understand relationships between words in the given text, such as who is doing what to whom.
Feed-Forward Layers: These layers act like a database of information that the model has learned from its training data. They allow the model to recall facts and relationships that aren't explicitly stated in the prompt. For example, they can remember that "Warsaw" is the capital of "Poland" even if this fact isn't mentioned in the prompt.
Layer Complexity: The earlier feed-forward layers tend to encode simpler facts related to specific words, such as "Trump often comes after Donald." The later layers encode more complex relationships, such as "add this vector to convert a country to its capital." This shows that as the model processes the text, it moves from understanding basic word relationships to grasping more complex concepts and facts.
How language models are trained
Training Language Models: Language models like GPT-3 are trained using large amounts of text data. They learn by predicting the next word in a sequence of words. For example, given the sentence "I like my coffee with cream and...", the model might predict "sugar" as the next word.
No Need for Labeled Data: Unlike many early machine learning models that required human-labeled data, language models don't need explicitly labeled data. They can learn from any written material, such as Wikipedia articles or news stories.
Weight Adjustment: Initially, the model's predictions are not very accurate because its weight parameters (which determine how it makes predictions) start off as random numbers. However, as the model processes more and more text data, these weights are gradually adjusted to improve the model's predictions.
The Training Process: The training process involves two steps. The first is a "forward pass," where the model makes a prediction. The second is a "backwards pass," where the model adjusts its weight parameters based on how accurate its prediction was. This adjustment process is guided by an algorithm called backpropagation.
Scale of Training: Training a model like GPT-3 is a massive task. It involves repeating the forward and backward pass billions of times, once for each word in the training data. This requires a huge amount of computational power and time. For instance, training GPT-3 took months of work for dozens of high-end computer chips and involved over 300 billion trillion calculations.
The surprising performance of GPT-3
Surprising Performance of GPT-3: Language models like GPT-3 have shown impressive performance in tasks like essay writing, analogy drawing, and even coding. This is largely due to the scale of their training - GPT-3, for example, was trained on about 500 billion words, which is much more than a human child encounters by age 10.
Increasing Model Size: Over the years, OpenAI has increased the size of its language models, which has led to better performance on language tasks. The larger models not only learned more facts but also performed better on tasks requiring abstract reasoning.
Theory of Mind: Recent research has shown that larger models like GPT-3 can perform well on tasks that require understanding the mental states of others, a capability known as "theory of mind". However, there's ongoing debate about whether these results truly indicate a form of understanding or are just the result of the model's ability to predict language patterns.
Artificial General Intelligence: There are also examples of language models showing signs of artificial general intelligence, the ability to think in a sophisticated, human-like way. For instance, GPT-4 was able to generate code to draw a unicorn and correctly place a horn on it when asked, despite its training data being entirely text-based.
Predictability of Language: One reason why language models work so well could be the predictability of language itself. Regularities in language often reflect regularities in the physical world, so when a language model learns about relationships among words, it's often implicitly learning about relationships in the world too.
Prediction as a Foundation of Intelligence: The idea of making predictions is seen as foundational to both biological and artificial intelligence. The human brain is often thought of as a "prediction machine" that makes predictions about our environment to navigate it successfully. Similarly, language models learn a lot about how human language works by figuring out how to best predict the next word. However, this approach has led to systems whose inner workings we don't fully understand.