LLM Deep Dive 1: Tokenizer Engineering with MiniMind

Back to Fundamentals

It’s been a while since I hacked on large language models, and what better way to refresh than to rebuild from scratch? I’m using MiniMind — a lightweight, no-bullshit implementation — to dissect the essential components of LLM architecture, starting from ground zero.

Tokenizer: The Gateway to Neural Understanding

Tokenization is the critical interface between human-readable text and machine-processable tensors. It’s the bidirectional encoder/decoder that transforms raw strings into token ID sequences (and back), sitting at both ends of every LLM interaction pipeline.

MiniMind implements a Byte Pair Encoding (BPE) tokenizer — one of the most battle-tested sub-word tokenization algorithms in production systems today.

Training Pipeline Breakdown

MiniMind ships with a pre-trained tokenizer but also exposes training scripts for custom vocabulary construction. Key specs:

Vocabulary size: 6,400 tokens
Comparison: Qwen2 (151,643) | Llama3 (128,000)
Design rationale: Compact vocab → fewer embedding parameters → lighter model footprint

This is engineering at its finest: trade-offs between vocabulary coverage and parameter efficiency.

Reserved Special Tokens (never split, hard-coded at positions 0, 1, 2):

<|endoftext|> — Text termination signal
<|im_start|> — Instruction/message boundary (start)
<|im_end|> — Instruction/message boundary (end)

These tokens act as control flow markers in the token stream.

BPE Algorithm: Iterative Merging

The core of BPE is an iterative statistical compression algorithm. Let me walk through a hands-on example.

Training corpus: "low lower newest widest"

Step 1: Initialize Vocabulary (Character-Level Decomposition)

We decompose each word into atomic units (characters), appending a special boundary marker </w> at word endings. This disambiguates cases like “cat” vs “cats” and preserves word boundaries in the encoding.

1
l o w <\w>
2
l o w e r <\w>
3
n e w e s t <\w>
4
w i d e s t <\w>

Initial vocabulary (unique characters):
{'l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', '</w>'}

Step 2: Frequency Analysis & Greedy Merging

Count adjacent bigram frequencies across the corpus:

('l', 'o') → 2 occurrences (in “low” and “lower”)
('o', 'w') → 2 occurrences (in “low” and “lower”)
('w', '</w>') → 1 occurrence
('w', 'e') → 1 occurrence
('e', 'r') → 1 occurrence
… (other pairs)

Tie-breaking: Both ('l', 'o') and ('o', 'w') are tied at frequency=2. We pick ('l', 'o') arbitrarily (deterministic ordering breaks ties in practice) and merge into a new token: 'lo'.

Updated corpus representation:

1
lo w </w>         # low
2
lo w e r </w>     # lower
3
n e w e s t </w>  # newest
4
w i d e s t </w>  # widest

Updated vocabulary:
{'l', 'o', 'w', 'e', 'r', 'n', 's', 't', 'i', 'd', '</w>', 'lo'}

Step 3: Iterate Until Convergence

Stopping criteria: Either hit max merge operations OR reach target vocabulary size.

Next iteration: ('lo', 'w') now appears twice → merge into 'low'.

1
low </w>          # low
2
low e r </w>      # lower
3
n e w e s t </w>  # newest
4
w i d e s t </w>  # widest

Vocab update: {..., 'low'}

Next iteration: ('e', 's') → merge into 'es', then ('es', 't') appears twice → merge into 'est'.

1
low </w>          # low
2
low e r </w>      # lower
3
n e w est </w>    # newest
4
w i d est </w>    # widest

Final vocab: {..., 'low', 'est'}

Training output: An ordered list of merge rules — essentially a compression codebook mapping character pairs to sub-word tokens. This becomes our inference vocabulary.

Inference: Encoding New Text

Let’s encode an out-of-vocabulary word: "lowest"

Step 1: Character Initialization

Decompose input: ['l', 'o', 'w', 'e', 's', 't', '</w>']

Step 2: Apply Merge Rules Sequentially

Traverse the learned merge rules in training order:

Rule: ('l', 'o') → 'lo'
Result: ['lo', 'w', 'e', 's', 't', '</w>']
Rule: ('lo', 'w') → 'low'
Result: ['low', 'e', 's', 't', '</w>']
Rule: ('e', 's') → 'es'
Result: ['low', 'es', 't', '</w>']
Rule: ('es', 't') → 'est'
Result: ['low', 'est', '</w>']
(Continue through remaining rules… no further merges possible)

Step 3: Final Encoding

Token sequence: ['low', 'est', '</w>']
Token IDs (vocab indices): e.g., [13, 14, 11]

Notice how the unseen word “lowest” was decomposed into known sub-words 'low' and 'est' — this is BPE’s zero-shot generalization in action. In production, we transmit token IDs (integers) rather than token strings for computational efficiency.

Why BPE Dominates Modern NLP

BPE strikes an optimal balance across multiple dimensions:

✅ OOV resilience: Decomposes rare/unknown words into known sub-words (no <UNK> tokens)
✅ Compression efficiency: Balances vocabulary size vs. sequence length (fewer tokens = faster inference)
✅ Language-agnostic: Byte-level variants handle any UTF-8 text (no language-specific rules)
✅ Morphological awareness: Learns linguistic patterns like affixes (-ing, -tion, un-, etc.)
✅ Data-driven: No hand-crafted rules — learns from your domain’s statistical distribution

Next up: We’ll dive into other parts of building a LLM. Stay tuned. 🔥