How a Large Language Model works

A deep dive into the real mechanics — and how this project simulates them.

🧠 1. What is a Large Language Model?

A Large Language Model (LLM) is a neural network trained to predict the most likely next word (or token) given everything that came before it. That's it. The surprising thing is that a single objective — predict the next token — turns out to be enough to learn grammar, facts, reasoning patterns, and even code, if you train on enough text.

Modern LLMs are built on the Transformer architecture (introduced by Google in 2017). A Transformer stacks many identical layers, each of which applies a mechanism called self-attention to find relationships between all tokens in the current context simultaneously.

📏 Scale matters
GPT-4 has an estimated 1 trillion parameters; LLaMA 3 (70B) has 70 billion. Every parameter is a number that was tuned during training to minimise a single objective: predict the next token. The "knowledge" of an LLM is entirely encoded in those numbers.

The pipeline from text to answer involves four conceptually distinct stages: tokenize → embed → process → generate. The rest of this page explains each one.

✂️ 2. Tokenization — text becomes numbers

Neural networks can only process numbers, so the first step is converting text into a sequence of integers called token IDs. A tokenizer splits the input string into tokens — units that may be whole words, sub-words, or single characters — and maps each to an integer from a fixed vocabulary.

Real LLMs: Byte-Pair Encoding (BPE)

Most modern LLMs use BPE or a variant (SentencePiece, WordPiece). BPE starts with a vocabulary of individual bytes and repeatedly merges the most frequent adjacent pair. This lets the tokenizer handle any input (it can always fall back to byte-level) while keeping common words as single tokens.

Input: "tokenization" GPT-4 BPE splits this as: "token"2963 "ization"2065 So the input becomes the sequence: [ 2963, 2065 ] Vocabulary size: ~100 000 tokens (varies by model)

Why sub-word tokenization?

Pure word-based tokenizers fail on rare words and typos. Character-based tokenizers produce very long sequences and slow down training. Sub-word tokenization is the sweet spot: common words are one token, rare words are split into familiar sub-parts.

🔬 In this simulation
SimpleTokenizer uses a regex to split on whitespace and punctuation (\w+|[^\w\s]) and assigns a fresh integer ID to every new surface form it encounters. The vocabulary grows dynamically at runtime. This makes the encode/decode pipeline fully transparent and easy to inspect in the trace — but it is not how real tokenizers work.

🔢 3. Embeddings — numbers become vectors

A token ID like 2963 is just an index. Before the Transformer can work with it, each ID is converted into a dense vector (a list of floating-point numbers, typically 768–8192 dimensions) called an embedding.

Think of an embedding as a point in a very high-dimensional space. The training process arranges these points so that semantically similar tokens end up nearby: "king" and "queen" are close; "banana" and "circuit" are far apart.

Token → Embedding (simplified to 4 dimensions for readability) "king"[ 0.82, -0.31, 0.54, 0.11 ] "queen"[ 0.79, -0.28, 0.53, 0.14 ] ← very similar "banana"[-0.12, 0.67, -0.44, 0.89 ] ← different region Real models use 768 – 8192 dimensions.
⚠️ Simulation gap
This project skips embeddings entirely. There are no learned vector representations here — we go directly from token IDs to the scoring step. This is the biggest simplification in the simulation.

👁️ 4. Attention — context shapes meaning

The core innovation of the Transformer is self-attention. At each layer, every token looks at every other token and decides how much it should "attend to" them when building its own representation.

Concretely: for each token, three vectors are computed — Query (Q), Key (K), and Value (V). Attention scores are the dot product of Q with all K vectors, normalised by softmax. The output is a weighted sum of V vectors.

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V dₖ = dimension of the key vectors (scaling prevents vanishing gradients)

In plain English: "The word 'it' in 'The animal was tired because it…' attends strongly to 'animal', so the model represents 'it' as meaning 'the animal'." Attention allows the model to resolve such references regardless of how far apart the tokens are.

Multi-head attention

In practice, attention is computed h times in parallel with different learned projections (the "heads"). Each head can specialise in a different type of relationship: one might track syntax, another coreference, another semantic similarity. The outputs of all heads are concatenated and projected back to the model dimension.

⚠️ Simulation gap
This project does not implement attention. There are no Q/K/V matrices. The "context" used during generation is simply the list of token IDs that have been generated so far — useful for the repetition penalty, but not a real representation.

5. The generation loop — one token at a time

LLMs generate text autoregressively: one token at a time, each new token conditioned on everything generated so far.

📥
Input context
All tokens generated so far (prompt + previous output) form the context window.
🔢
Forward pass
The full context is fed through all Transformer layers (embedding → attention × N → layer norm).
📊
Logits
The final layer projects to vocabulary size, producing a raw score (logit) for every possible next token.
🎲
Sampling
Logits are converted to probabilities via softmax (with temperature). One token is sampled from the distribution.
🔁
Append & repeat
The sampled token is appended to the context. The loop repeats until an EOS token is generated or max length is reached.
📐 Context window
Every forward pass must process the entire context, which is why the context window size is such an important parameter. GPT-4 supports up to 128 000 tokens; Claude 3.5 up to 200 000. Attention's O(n²) complexity makes very long contexts expensive.
🔬 In this simulation
The generation loop in LLMCore.generate() does iterate token-by-token. For each step it:
  1. Draws top_k candidate tokens (the correct one + random distractors from the vocabulary).
  2. Assigns a pseudo-random base score to each candidate.
  3. Applies a repetition penalty to tokens seen in the last 10 context tokens.
  4. Boosts the target token so the demo output remains coherent.
  5. Applies temperature-scaled softmax.
  6. Logs the full probability table to the trace.
The forward pass through a real neural network is replaced by the scoring heuristics above.

🌡️ 6. Softmax & temperature — shaping the distribution

After the forward pass, the model has a raw score (logit) for every token in the vocabulary. Softmax converts those scores into a proper probability distribution: all values become positive and sum to exactly 1.

P(tokenᵢ) = exp(logitᵢ / T) / Σⱼ exp(logitⱼ / T) T = temperature (default 1.0 in most models)

What does temperature do?

Temperature T is applied as a divisor before the exponential:

  • T = 1.0 — the model's learned distribution, unchanged.
  • T < 1 — sharpens the distribution; the top token becomes even more dominant. More deterministic, less creative.
  • T > 1 — flattens the distribution; gives more probability mass to unlikely tokens. More diverse, more "creative" (and more likely to be wrong).
  • T → 0greedy decoding: always picks the single most probable token.

Here's an example showing how temperature changes a concrete distribution:

Context: "The capital of France is" → next token
T = 0.5 (sharp, deterministic)
"Paris"
96.1%
"Lyon"
2.8%
"Nice"
1.1%
T = 1.5 (flat, creative)
"Paris"
54.3%
"Lyon"
23.7%
"Nice"
14.1%
"Bordeaux"
7.9%

Other sampling strategies

Temperature is often combined with:

  • Top-k sampling — only sample from the k most probable tokens (discard everything else).
  • Top-p (nucleus) sampling — only sample from the smallest set of tokens whose cumulative probability exceeds p (e.g. 0.9).
  • Repetition penalty — reduce the score of tokens that already appeared in the context, to discourage repetitive output.
🔬 In this simulation
The simulation implements temperature-scaled softmax and a repetition penalty (×0.25 for tokens in the last 10 context positions). The top-k candidate table is stored in the trace at every step — you can inspect it in both the main UI (probability pills) and the Trace Viewer (full table with scores). The default temperature is 0.7.

🤖 7. Agents & tool use

A bare LLM is a text-in / text-out function. It cannot browse the web, run code, or call an API. Agents extend LLMs with the ability to use tools and take multi-step actions.

How tool use works in real systems

Modern frameworks (OpenAI function calling, Anthropic tool use) work roughly like this:

  1. A list of available tools (name, description, JSON schema of parameters) is prepended to the system prompt.
  2. The LLM is fine-tuned to output a structured JSON block when it decides a tool is needed (instead of plain text).
  3. The host application parses that JSON, calls the real tool, and inserts the result back into the conversation as a new message.
  4. The LLM continues generating from there, now with the tool result in its context.
User: "What is the weather in Rome?" │ ▼ LLM (decides tool needed): { "tool": "get_weather", "args": { "city": "Rome" } } │ ▼ Host calls real weather API → "18°C, partly cloudy" │ ▼ LLM continues: "The current weather in Rome is 18°C with partly cloudy skies."

Chain-of-thought reasoning

Before deciding which tool to call (or whether to call one at all), frontier models often produce a reasoning trace — a scratchpad of intermediate thoughts. In the OpenAI "o" series and DeepSeek-R1, this trace is explicit and inspectable. In older models it was hidden or only possible via prompting ("Let's think step by step").

🔬 In this simulation
ReasoningAgent uses simple rule-based heuristics instead of an LLM:
  • If the query contains an arithmetic expression + a trigger word ("calculate", "what is", …) → call the Calculator tool.
  • If it starts with a factual question word ("what is", "explain", …) → call the FakeSearch tool (in-memory knowledge base).
  • Otherwise → skip tools and go straight to the LLM core.
Every decision step is written explicitly into the trace so you can follow the reasoning in the Trace Viewer — exactly as you would with a real chain-of-thought system.
🔒 Security note: the Calculator tool
CalculatorTool uses Python's eval(), but only after validating the expression against a strict regex whitelist (^[\d\s\+\-\*\/\(\)\.]+$) and executing with __builtins__={} — so there is no access to any Python built-in or variable. Even so, eval() should never be used on untrusted input in a real production system.

🔬 8. What this simulation actually does

The full pipeline, from the moment you hit Ask → to the moment the trace is ready:

browser → POST /run → server.py │ ┌────────────────┼────────────────┐ ▼ ▼ ▼ 1. PromptBuilder 2. SimpleTokenizer 3. ReasoningAgent system + user text whitespace split intent detection → full prompt → token IDs → tool dispatch │ ┌───────────────┼───────────────────┐ ▼ │ ▼ 4a. CalculatorTool │ 4b. FakeSearchTool safe eval │ keyword search arithmetic │ in-memory KB └───────────►▼◄──────────────────────┘ │ 5. LLMCore.generate() for each target token: • draw top_k candidates • pseudo-random scores • repetition penalty • softmax + temperature • log probability table │ 6. final answer + llm_trace.jsonbrowser ← JSON ←──────────────────────┘ animated token pills + "View Trace" button

Every one of these steps writes a TraceStep into the append-only Trace object. At the end the trace is serialised to llm_trace.json and served at GET /llm_trace.json. The Trace Viewer fetches it and renders every step as an expandable card.

Real vs. simulated — side by side

Aspect Real LLM This simulation
Tokenizer BPE / WordPiece, fixed vocabulary of 32k–200k tokens Whitespace + punctuation regex, dynamic vocabulary
Embeddings Learned dense vectors (768–8192 dims) Not implemented — skipped
Attention Multi-head self-attention across full context Not implemented — context is just a list of IDs
Token scores Output of the final linear layer (learned logits) Pseudo-random scores + repetition penalty heuristic
Softmax Applied to all ~100k logits Applied to top_k=6 candidates — same math, smaller scale
Temperature Scales logits before softmax Same formula, same effect — fully functional
Sampling Draw from distribution (top-p / top-k) Target token is forced (but probability table is real)
Tool use LLM outputs structured JSON; framework parses it Rule-based intent detection; same logical structure
Reasoning trace Hidden (or via CoT prompting / o-series) Explicit, logged to trace at every step
Observability Black box (unless you add logging) Every decision visible in the Trace Viewer

⚠️ 9. What this simulation cannot show

This project is deliberately simplified for educational purposes. Here are the most important things that are not represented:

  • Emergent capabilities — the ability to translate, reason about logic, write code, etc. comes from training on enormous datasets with billions of parameters, not from the architecture alone.
  • RLHF / RLAIF — modern assistants (ChatGPT, Claude) are fine-tuned with reinforcement learning from human (or AI) feedback, which fundamentally shapes tone, safety, and instruction-following.
  • True semantic understanding — the simulation's token scores are random; a real model's scores encode the statistical regularities of human language.
  • KV-cache — real inference reuses the key/value tensors from previous tokens to avoid re-computing the full forward pass at every step (huge performance win). Not relevant here since there is no forward pass.
  • Quantization & hardware — running a 70B model requires tens of GB of VRAM. Techniques like 4-bit quantization (GGUF) and speculative decoding are critical in practice.
📚 Going further
To understand the real machinery in depth:
  • Attention Is All You Need — Vaswani et al. (2017), the original Transformer paper
  • The Illustrated GPT-2 by Jay Alammar — incredibly clear visual walkthrough
  • nanoGPT by Andrej Karpathy — a minimal, readable GPT implementation in ~300 lines of Python
  • Let's build GPT — Karpathy's 3-hour video building a GPT from scratch

Ready to see the pipeline in action?

Ask a question →