How a Large Language Model works

A deep dive into the real mechanics — and how this project simulates them.

🧠 1. What is a Large Language Model?

A Large Language Model (LLM) is a neural network trained to predict the most likely next word (or token) given everything that came before it.

Modern LLMs are built on the Transformer architecture (introduced by Google in 2017).

€ Scale matters
GPT-4 has an estimated 1 trillion parameters...

The pipeline from text to answer involves four conceptually distinct stages: tokenize → embed → process → generate.

✂️ 2. Tokenization — text becomes numbers

Neural networks can only process numbers, so the first step is converting text into a sequence of integers called token IDs.

Real LLMs: Byte-Pair Encoding (BPE)

Most modern LLMs use BPE or a variant (SentencePiece, WordPiece).

Input: "tokenization" GPT-4 BPE splits this as: "token"2963 "ization"2065 So the input becomes the sequence: [ 2963, 2065 ] Vocabulary size: ~100 000 tokens (varies by model)

Why sub-word tokenization?

Pure word-based tokenizers fail on rare words and typos.

🔮 In this simulation
SimpleTokenizer uses a regex to split on whitespace and punctuation.

🔢 3. Embeddings — numbers become vectors

A token ID like 2963 is just an index. Before the Transformer can work with it, each ID is converted into a dense vector called an embedding.

Think of an embedding as a point in a very high-dimensional space.

Token → Embedding (simplified to 4 dimensions for readability) "king"[ 0.82, -0.31, 0.54, 0.11 ] "queen"[ 0.79, -0.28, 0.53, 0.14 ] ← very similar "banana"[-0.12, 0.67, -0.44, 0.89 ] ← different region Real models use 768 – 8192 dimensions.
⚠️ Simulation gap
This project skips embeddings entirely.

👁️ 4. Attention — context shapes meaning

The core innovation of the Transformer is self-attention.

Concretely: for each token, three vectors are computed — Query (Q), Key (K), and Value (V).

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V dₖ = dimension of the key vectors (scaling prevents vanishing gradients)

In plain English: "The word 'it' in 'The animal was tired because it…' attends strongly to 'animal'."

Multi-head attention

In practice, attention is computed h times in parallel with different learned projections (the "heads").

⚠️ Simulation gap
This project does not implement attention. There are no Q/K/V matrices.

5. The generation loop — one token at a time

LLMs generate text autoregressively: one token at a time.

📥
Input context
All tokens generated so far form the context window.
🔢
Forward pass
The full context is fed through all Transformer layers.
📊
Logits
The final layer projects to vocabulary size, producing a raw score (logit) for every possible next token.
🎲
Sampling
Logits are converted to probabilities via softmax. One token is sampled.
🔁
Append & repeat
The sampled token is appended to the context. The loop repeats until EOS or max length.
€ Context window
Every forward pass must process the entire context.
🔮 In this simulation
The generation loop in LLMCore.generate() does iterate token-by-token.

🌡️ 6. Softmax & temperature — shaping the distribution

After the forward pass, the model has a raw score (logit) for every token. Softmax converts those scores into a proper probability distribution.

P(tokenᵢ) = exp(logitᵢ / T) / Σ₃ exp(logit₃ / T) T = temperature (default 1.0 in most models)

What does temperature do?

Temperature T is applied as a divisor before the exponential:

  • T = 1.0 — the model's learned distribution, unchanged.

Here's an example showing how temperature changes a concrete distribution:

Context: "The capital of France is" → next token
T = 0.5 (sharp, deterministic)
"Paris"
96.1%
"Lyon"
2.8%
"Nice"
1.1%
T = 1.5 (flat, creative)
"Paris"
54.3%
"Lyon"
23.7%
"Nice"
14.1%
"Bordeaux"
7.9%

Other sampling strategies

Temperature is often combined with:

  • Top-k sampling
🔮 In this simulation
The simulation implements temperature-scaled softmax and a repetition penalty.

🤖 7. Agents & tool use

A bare LLM is a text-in / text-out function. Agents extend LLMs with the ability to use tools and take multi-step actions.

How tool use works in real systems

Modern frameworks (OpenAI function calling, Anthropic tool use) work roughly like this:

  1. A list of available tools is prepended to the system prompt.
User: "What is the weather in Rome?" | ↓ LLM (decides tool needed): { "tool": "get_weather", "args": { "city": "Rome" } } | ↓ Host calls real weather API → "18°C, partly cloudy" | ↓ LLM continues: "The current weather in Rome is 18°C with partly cloudy skies."

Chain-of-thought reasoning

Before deciding which tool to call, frontier models often produce a reasoning trace.

🔮 In this simulation
ReasoningAgent uses simple rule-based heuristics.
🔒 Security note: the Calculator tool
CalculatorTool uses Python's eval().

🔮 8. What this simulation actually does

The full pipeline, from the moment you hit Ask → to the moment the trace is ready:

LLM-sim pipeline: browser → POST /run → server.py → PromptBuilder / SimpleTokenizer / ReasoningAgent → CalculatorTool / FakeSearchTool → LLMCore.generate() → final answer + llm_trace.json → browser

Every one of these steps writes a TraceStep into the append-only Trace object.

Real vs. simulated — side by side

Aspect Real LLM This simulation
Tokenizer BPE / WordPiece, fixed vocabulary of 32k–200k tokens Whitespace + punctuation regex, dynamic vocabulary
Embeddings Learned dense vectors (768–8192 dims) Not implemented — skipped
Attention Multi-head self-attention across full context Not implemented — context is just a list of IDs
Token scores Output of the final linear layer (learned logits) Pseudo-random scores + repetition penalty heuristic
Softmax Applied to all ~100k logits Applied to top_k=6 candidates — same math, smaller scale
Temperature Scales logits before softmax Same formula, same effect — fully functional
Sampling Draw from distribution (top-p / top-k) Target token is forced (but probability table is real)
Tool use LLM outputs structured JSON; framework parses it Rule-based intent detection; same logical structure
Reasoning trace Hidden (or via CoT prompting / o-series) Explicit, logged to trace at every step
Observability Black box (unless you add logging) Every decision visible in the Trace Viewer

⚠️ 9. What this simulation cannot show

This project is deliberately simplified for educational purposes. Here are the most important things that are not represented:

  • Emergent capabilities
📚 Going further
To understand the real machinery in depth…

Ready to see the pipeline in action?

Ask a question →