Back to Fundamentals
It’s been a while since I hacked on large language models, and what better way to refresh than to rebuild from scratch? I’m using MiniMind — a lightweight, no-bullshit implementation — to dissect the essential components of LLM architecture, starting from ground zero.
Tokenizer: The Gateway to Neural Understanding
Tokenization is the critical interface between human-readable text and machine-processable tensors. It’s the bidirectional encoder/decoder that transforms raw strings into token ID sequences (and back), sitting at both ends of every LLM interaction pipeline.
MiniMind implements a Byte Pair Encoding (BPE) tokenizer — one of the most battle-tested sub-word tokenization algorithms in production systems today.
Training Pipeline Breakdown
MiniMind ships with a pre-trained tokenizer but also exposes training scripts for custom vocabulary construction. Key specs:
- Vocabulary size: 6,400 tokens
- Comparison: Qwen2 (151,643) | Llama3 (128,000)
- Design rationale: Compact vocab → fewer embedding parameters → lighter model footprint
This is engineering at its finest: trade-offs between vocabulary coverage and parameter efficiency.
Reserved Special Tokens (never split, hard-coded at positions 0, 1, 2):
<|endoftext|>— Text termination signal<|im_start|>— Instruction/message boundary (start)<|im_end|>— Instruction/message boundary (end)
These tokens act as control flow markers in the token stream.
BPE Algorithm: Iterative Merging
The core of BPE is an iterative statistical compression algorithm. Let me walk through a hands-on example.
Training corpus: "low lower newest widest"
Step 1: Initialize Vocabulary (Character-Level Decomposition)
We decompose each word into atomic units (characters), appending a special boundary marker </w> at word endings. This disambiguates cases like “cat” vs “cats” and preserves word boundaries in the encoding.
1l o w <\w>2l o w e r <\w>3n e w e s t <\w>4w i d e s t <\w>Initial vocabulary (unique characters):
{'l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', '</w>'}
Step 2: Frequency Analysis & Greedy Merging
Count adjacent bigram frequencies across the corpus:
('l', 'o')→ 2 occurrences (in “low” and “lower”)('o', 'w')→ 2 occurrences (in “low” and “lower”)('w', '</w>')→ 1 occurrence('w', 'e')→ 1 occurrence('e', 'r')→ 1 occurrence- … (other pairs)
Tie-breaking: Both ('l', 'o') and ('o', 'w') are tied at frequency=2. We pick ('l', 'o') arbitrarily (deterministic ordering breaks ties in practice) and merge into a new token: 'lo'.
Updated corpus representation:
1lo w </w> # low2lo w e r </w> # lower3n e w e s t </w> # newest4w i d e s t </w> # widestUpdated vocabulary:
{'l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', '</w>', 'lo'}
Step 3: Iterate Until Convergence
Stopping criteria: Either hit max merge operations OR reach target vocabulary size.
Next iteration: ('lo', 'w') now appears twice → merge into 'low'.
1low </w> # low2low e r </w> # lower3n e w e s t </w> # newest4w i d e s t </w> # widestVocab update: {..., 'low'}
Next iteration: ('e', 's') → merge into 'es', then ('es', 't') appears twice → merge into 'est'.
1low </w> # low2low e r </w> # lower3n e w est </w> # newest4w i d est </w> # widestFinal vocab: {..., 'low', 'est'}
Training output: An ordered list of merge rules — essentially a compression codebook mapping character pairs to sub-word tokens. This becomes our inference vocabulary.
Inference: Encoding New Text
Let’s encode an out-of-vocabulary word: "lowest"
Step 1: Character Initialization
Decompose input: ['l', 'o', 'w', 'e', 's', 't', '</w>']
Step 2: Apply Merge Rules Sequentially
Traverse the learned merge rules in training order:
-
Rule:
('l', 'o') → 'lo'
Result:['lo', 'w', 'e', 's', 't', '</w>'] -
Rule:
('lo', 'w') → 'low'
Result:['low', 'e', 's', 't', '</w>'] -
Rule:
('e', 's') → 'es'
Result:['low', 'es', 't', '</w>'] -
Rule:
('es', 't') → 'est'
Result:['low', 'est', '</w>'] -
(Continue through remaining rules… no further merges possible)
Step 3: Final Encoding
Token sequence: ['low', 'est', '</w>']
Token IDs (vocab indices): e.g., [13, 14, 11]
Notice how the unseen word “lowest” was decomposed into known sub-words 'low' and 'est' — this is BPE’s zero-shot generalization in action. In production, we transmit token IDs (integers) rather than token strings for computational efficiency.
Why BPE Dominates Modern NLP
BPE strikes an optimal balance across multiple dimensions:
- ✅ OOV resilience: Decomposes rare/unknown words into known sub-words (no
<UNK>tokens) - ✅ Compression efficiency: Balances vocabulary size vs. sequence length (fewer tokens = faster inference)
- ✅ Language-agnostic: Byte-level variants handle any UTF-8 text (no language-specific rules)
- ✅ Morphological awareness: Learns linguistic patterns like affixes (
-ing,-tion,un-, etc.) - ✅ Data-driven: No hand-crafted rules — learns from your domain’s statistical distribution
Next up: We’ll dive into other parts of building a LLM. Stay tuned. 🔥