How it works - LLM-sim

🧠 1. What is a Large Language Model?

A Large Language Model (LLM) is a neural network trained to predict the most likely next word (or token) given everything that came before it.

Modern LLMs are built on the Transformer architecture (introduced by Google in 2017).

€ Scale matters

GPT-4 has an estimated 1 trillion parameters...

The pipeline from text to answer involves four conceptually distinct stages: tokenize → embed → process → generate.

✂️ 2. Tokenization — text becomes numbers

Neural networks can only process numbers, so the first step is converting text into a sequence of integers called token IDs.

Real LLMs: Byte-Pair Encoding (BPE)

Most modern LLMs use BPE or a variant (SentencePiece, WordPiece).

Input: "tokenization" GPT-4 BPE splits this as: "token" → 2963 "ization" → 2065 So the input becomes the sequence: [ 2963, 2065 ] Vocabulary size: ~100 000 tokens (varies by model)

Why sub-word tokenization?

Pure word-based tokenizers fail on rare words and typos.

🔮 In this simulation

SimpleTokenizer uses a regex to split on whitespace and punctuation.

🔢 3. Embeddings — numbers become vectors

A token ID like 2963 is just an index. Before the Transformer can work with it, each ID is converted into a dense vector called an embedding.

Think of an embedding as a point in a very high-dimensional space.

Token → Embedding (simplified to 4 dimensions for readability) "king" → [ 0.82, -0.31, 0.54, 0.11 ] "queen" → [ 0.79, -0.28, 0.53, 0.14 ] ← very similar "banana" → [-0.12, 0.67, -0.44, 0.89 ] ← different region Real models use 768 – 8192 dimensions.

⚠️ Simulation gap

This project skips embeddings entirely.

👁️ 4. Attention — context shapes meaning

The core innovation of the Transformer is self-attention.

Concretely: for each token, three vectors are computed — Query (Q), Key (K), and Value (V).

Attention(Q, K, V) = softmax( Q·Kᵀ / √dₖ ) · V dₖ = dimension of the key vectors (scaling prevents vanishing gradients)

In plain English: "The word 'it' in 'The animal was tired because it…' attends strongly to 'animal'."

Multi-head attention

In practice, attention is computed h times in parallel with different learned projections (the "heads").

⚠️ Simulation gap

This project does not implement attention. There are no Q/K/V matrices.

⚡ 5. The generation loop — one token at a time

LLMs generate text autoregressively: one token at a time.

📥

Input context

All tokens generated so far form the context window.

🔢

Forward pass

The full context is fed through all Transformer layers.

📊

Logits

The final layer projects to vocabulary size, producing a raw score (logit) for every possible next token.

🎲

Sampling

Logits are converted to probabilities via softmax. One token is sampled.

🔁

Append & repeat

The sampled token is appended to the context. The loop repeats until EOS or max length.

€ Context window

Every forward pass must process the entire context.

🔮 In this simulation

The generation loop in LLMCore.generate() does iterate token-by-token.

🌡️ 6. Softmax & temperature — shaping the distribution

After the forward pass, the model has a raw score (logit) for every token. Softmax converts those scores into a proper probability distribution.

P(tokenᵢ) = exp(logitᵢ / T) / Σ₃ exp(logit₃ / T) T = temperature (default 1.0 in most models)

What does temperature do?

Temperature T is applied as a divisor before the exponential:

T = 1.0 — the model's learned distribution, unchanged.

Here's an example showing how temperature changes a concrete distribution:

Context: "The capital of France is" → next token

T = 0.5 (sharp, deterministic)

"Paris"

96.1%

"Lyon"

2.8%

"Nice"

1.1%

T = 1.5 (flat, creative)

"Paris"

54.3%

"Lyon"

23.7%

"Nice"

14.1%

"Bordeaux"

7.9%

Other sampling strategies

Temperature is often combined with:

Top-k sampling

🔮 In this simulation

The simulation implements temperature-scaled softmax and a repetition penalty.

🤖 7. Agents & tool use

A bare LLM is a text-in / text-out function. Agents extend LLMs with the ability to use tools and take multi-step actions.

How tool use works in real systems

Modern frameworks (OpenAI function calling, Anthropic tool use) work roughly like this:

A list of available tools is prepended to the system prompt.

User: "What is the weather in Rome?" | ↓ LLM (decides tool needed): { "tool": "get_weather", "args": { "city": "Rome" } } | ↓ Host calls real weather API → "18°C, partly cloudy" | ↓ LLM continues: "The current weather in Rome is 18°C with partly cloudy skies."

Chain-of-thought reasoning

Before deciding which tool to call, frontier models often produce a reasoning trace.

🔮 In this simulation

ReasoningAgent uses simple rule-based heuristics.

🔒 Security note: the Calculator tool

CalculatorTool uses Python's eval().

🔮 8. What this simulation actually does

The full pipeline, from the moment you hit Ask → to the moment the trace is ready:

LLM-sim pipeline: browser → POST /run → server.py → PromptBuilder / SimpleTokenizer / ReasoningAgent → CalculatorTool / FakeSearchTool → LLMCore.generate() → final answer + llm_trace.json → browser

Every one of these steps writes a TraceStep into the append-only Trace object.

Real vs. simulated — side by side

Aspect	Real LLM	This simulation
Tokenizer	BPE / WordPiece, fixed vocabulary of 32k–200k tokens	Whitespace + punctuation regex, dynamic vocabulary
Embeddings	Learned dense vectors (768–8192 dims)	Not implemented — skipped
Attention	Multi-head self-attention across full context	Not implemented — context is just a list of IDs
Token scores	Output of the final linear layer (learned logits)	Pseudo-random scores + repetition penalty heuristic
Softmax	Applied to all ~100k logits	Applied to top_k=6 candidates — same math, smaller scale
Temperature	Scales logits before softmax	Same formula, same effect — fully functional
Sampling	Draw from distribution (top-p / top-k)	Target token is forced (but probability table is real)
Tool use	LLM outputs structured JSON; framework parses it	Rule-based intent detection; same logical structure
Reasoning trace	Hidden (or via CoT prompting / o-series)	Explicit, logged to trace at every step
Observability	Black box (unless you add logging)	Every decision visible in the Trace Viewer

⚠️ 9. What this simulation cannot show

This project is deliberately simplified for educational purposes. Here are the most important things that are not represented:

Emergent capabilities

📚 Going further

To understand the real machinery in depth…

Ready to see the pipeline in action?

Ask a question →

How a Large Language Model works

🧠 1. What is a Large Language Model?

✂️ 2. Tokenization — text becomes numbers

Real LLMs: Byte-Pair Encoding (BPE)

Why sub-word tokenization?

🔢 3. Embeddings — numbers become vectors

👁️ 4. Attention — context shapes meaning

Multi-head attention

⚡ 5. The generation loop — one token at a time

🌡️ 6. Softmax & temperature — shaping the distribution

What does temperature do?

Other sampling strategies

🤖 7. Agents & tool use

How tool use works in real systems

Chain-of-thought reasoning

🔮 8. What this simulation actually does

Real vs. simulated — side by side

⚠️ 9. What this simulation cannot show