You are called to the scene of a possible burglary. An empty gun cartridge lies on the floor next to the overturned wastepaper basket (which was full of torn documents). There has clearly been a struggle as the floor is scattered with objects on their side; a paperweight and a letter opener in the shape a snake, a set of car keys and some tacky Mexican Day of the Dead souvenirs. The floor on the other side of the room next to the sofa is wet but it’s not obvious what liquid has been spilled, or perhaps sprayed.
These days, I’m teaching a lot of people about Large Language Models. During every session, we discuss students preconceived ideas about what is going on. Some see ChatGPT and Claude as “like Google but less extensive or trustworthy”, “like Google but just one answer” or “powerful but not up-to-date”. These are all true, in one sense, but by thinking about how we use the tools and building robust mental models of what is going on, we should begin to see the immense capabilities and power of this emerging technology. Unlike Google, LLMs generate text rather than retrieving information. They can offer radically different perspectives when guided by well-crafted prompts. And by supplementing prompts with news searches, documents, and relevant context, we can explore even the latest trends and complex issues effectively.
Developing a mental model is hard as there is so much going on: training, feedback, fine-tuning, spotting patterns, forming embeddings, adjusting weights, context windows, RAG and more. I try to break my explanations into a series of simple stories.
And what better stories than detective mysteries? Here then, is part of an explanation (I have others) of attention and embeddings.
View the job of the training phase of a large language model as a detective trying to make sense of a crime scene (like the one above). This detective has seen other cases and has developed a knack from asking questions, linking clues and paying attention to what might have happened.
In a large language model all text from documents is first turned into embeddings; these are a representation of the text in a highly coded form (as vectors of numbers). Think of embeddings as encoding the patterns in data or evidence in ways that can be used later; a bit like a case file that can be interrogated later in court. The term “embeddings” is not immediately intuitive but see them as a strategy for capturing meaning or order in the data.
A rookie detective would simply record the visible clues at the scene but a more experienced one would start to make connections. She might first look perhaps at the proximity (or distance) between clues. How often is a body found near to a weapon? Over many cases detectives develop a sense of what is likely to be found nearby. This is similar to looking at how often words appear near each other in the billions of text sequences that are used to train an LLM.
Both detectives and LLMs focus attention on bigger relationships involving time (the sequence of events), possible suspects (and their relationships to each other) and even high level ideas such as revenge, motives, excuses and alibis. In LLM training it is these long range relationships that appear to provide the real power of the models and as in a detective story the secret is to understand the setting, the context and the ways things normally behave.
The next piece of evidence builds the story.
This is the third suspected burglary in the area in the past week and a pattern is starting to emerge: no forced entry, mess everywhere (but only in one room), and Día de los Muertos objects at all of them.
And this is what happens when additional data is presented to an LLM; the embeddings are updated to reflect the new knowledge. The model now knows more than it did and so does the detective.
Keys, queries and values
In the neural networks of an LLM, the embeddings are not the whole story. Embeddings are the inputs to the reasoning process and the neural network transforms them into keys, queries and values. The key captures features of the token that are important for other tokens to consider when attending to this token. The query represents what a specific token is looking for in other tokens (what it needs). The value contains the information that should be passed forward for further processing. We need to try and make sense of this using our metaphor.
As our detective examines each piece of evidence (token), she approaches it with specific questions in mind (queries). These questions are shaped by her experience and the current context of the investigation.
Each piece of evidence has distinct features (keys) that make it relevant to certain questions. The gun cartridge’s caliber, the contents of the torn documents, the snake-shaped letter opener’s unusual design – these are all ‘keys’ that could unlock important connections.
When a piece of evidence’s features (keys) align well with the detective’s questions (queries), she pays close attention to the information it provides (values). This valuable information is what she’ll use to piece together the full story of the crime.
For instance, when examining the Day of the Dead souvenirs (a key feature), the detective might ask, ‘How does this relate to other recent burglaries?’ The value here is the emerging pattern of these objects at multiple crime scenes, which becomes crucial information for the ongoing investigation.
Just as our detective weighs the importance of different pieces of evidence, the attention mechanism in an LLM uses the relationship between queries, keys, and values to determine which parts of the input are most relevant for generating the next part of the output.
I want to find many more stories to tell that will help people make sense of this technology themselves. A single story (like a single crime scene) will not be enough. Like an LLM, a learner needs to piece together clues over time. Only then will they be able to start to appreciate what is going on and, I believe, that will help them use these tools better.
0 Comments