LLM Deep Dive 2: Positional Embedding

Revisit Positional Embedding

I was planning to jump straight into Rotary Positional Embedding (RoPE) — the mechanism powering MiniMind. But as I dug deeper, I realized I had glossed over the foundational sinusoidal encoding from the seminal Attention Is All You Need paper.

There are brilliant explainers out there, but most skip the why behind the math. So here’s my attempt to crystallize the key insights — not as a tutorial, but as an engineering deep-dive into the design decisions that make this encoding scheme so damn clever (and where it breaks down).

The Core Formula:

\begin{align*} PE(pos,2i)&=\sin(pos/10000^{2i/d_{model}})\\ PE(pos,2i+1)&=\cos(pos/10000^{2i/d_{model}}) \end{align*}

Here $pos$ is your absolute position in the sequence, $d_{model}$ is the embedding dimension (typically 512 or 768), and $i \in [0, d_{model}/2-1]$ indexes dimension pairs. Every pair of dimensions $(2i, 2i+1)$ encodes position using a unique frequency — from high-frequency local jitter to low-frequency global structure. Let’s unpack why this architecture works, and more importantly, why it eventually fails.

Q1: Does positional embedding destroy token semantics?

Short answer: Yes, by design. And that’s the whole point.

After tokenization and embedding lookup, you have a vocabulary-indexed tensor — essentially a static dictionary. Think of it as having every word from the Oxford English Dictionary memorized, but with zero understanding of grammar or word order. The problem is fundamental: self-attention is permutation-invariant. Without positional signals, "dog bites man" and "man bites dog" produce identical representations. The model has no concept of sequence.

The solution is deceptively simple: inject positional information directly into the embedding space via addition:

x_{pos} = \text{TE}(token) + PE(pos)

This is intentional signal corruption. You’re encoding two orthogonal pieces of information in the same vector: what the token is (semantic content) and where it appears (structural position). The genius lies in letting the attention mechanism disentangle these two signals through learned projections $W_Q, W_K, W_V$ . The model learns to filter and recombine content + position information based on task requirements.

Why does this work? Training on massive corpora forces the model to learn that the word “bank” at position 0 (sentence start) likely means financial institution, while the same word at position 5 after “river” likely means riverbank. Positional context disambiguates meaning. This is why LLMs need astronomical training data — not just to learn vocabulary, but to learn positionally-conditioned usage patterns. The embedding space becomes a high-dimensional manifold where semantic similarity and positional proximity are both encoded and must be jointly learned.

Q2: Why sinusoidal functions? Why not just position indices?

The naive approach would be to just feed position numbers directly: 0, 1, 2, …, 511. Seems simple enough, right? Wrong. This breaks down in three critical ways.

First, there’s the scale ambiguity problem. For a model trained on sequences of length 512, how should it interpret position 1000? The embedding space has never seen values above 511. There are no extrapolation guarantees. The model would be flying blind.

Second, no inherent structure for relative positions. Position numbers are purely absolute. The model must independently learn that positions 5 and 6 are neighbors, while 5 and 500 are distant. There’s no built-in notion of “closeness” or “distance” — it’s just arbitrary integers.

Third, the arbitrary magnitude problem creates gradient flow issues. Position 500 has 100× the magnitude of position 5. This magnitude explosion destabilizes optimization and makes training a nightmare.

The sinusoidal solution is elegant because it addresses all three problems simultaneously. Let me walk you through the properties that make it work.

Bounded outputs everywhere. No matter how long the sequence, $PE(pos, i) \in [-1, 1]$ for all positions and dimensions. This normalization prevents magnitude explosion and keeps gradients stable during training. Your position encoding at token 10,000 has the same magnitude as position 10.

Deterministic extrapolation via parameter-free encoding. The formula contains zero learned weights. This is crucial: sequences of length 1000 use the exact same encoding function as sequences of length 512. There’s no model retraining required for longer contexts — the function just extends naturally. Though as we’ll see later, this theoretical property doesn’t fully translate to practice.

Multi-scale frequency encoding through geometric progression. Here’s the key insight: position isn’t a single scalar — it’s a spectrum of wavelengths. The denominator $10000^{2i/d_{model}}$ creates a geometric progression of frequencies. Low dimensions ( $i \approx 0$ ) have frequency $\approx 1$ , causing fast oscillation that captures local patterns between adjacent tokens. High dimensions ( $i \approx d_{model}/2$ ) have frequency $\approx 1/10000$ , creating slow oscillation that captures global structure across paragraphs. Think of it as a Fourier-like decomposition: each dimension pair encodes position at a different resolution, from token-level jitter to document-level structure.

Smooth interpolation between adjacent positions. The encoding ensures that $|PE(pos) - PE(pos+1)| \approx O(1/d_{model})$ . This smoothness is critical — it means the model learns continuous positional relationships rather than memorizing discrete position IDs. Nearby positions have similar representations, enabling the model to generalize patterns across different absolute positions.

Q3: Why alternate sine and cosine? The rotation matrix trick

Here’s where the magic happens. The $\sin$ / $\cos$ pairing isn’t arbitrary — it encodes relative position as a rotation.

The Mathematical Proof

Recall the angle addition formulas from trigonometry:

\sin(\alpha+\beta)=\sin(\alpha)\cos(\beta)+\cos(\alpha)\sin(\beta)\\ \cos(\alpha+\beta)=\cos(\alpha)\cos(\beta)-\sin(\alpha)\sin(\beta)

Now let $\alpha = pos / 10000^{2i/d_{model}}$ be the angle at position $pos$ , and $\beta = k / 10000^{2i/d_{model}}$ be the angle offset for a relative distance $k$ . Here’s the beautiful part: the encoding at position $pos+k$ can be derived directly from the encoding at $pos$ through a linear transformation.

\begin{align*} PE(pos+k,2i)&=\sin(\alpha+\beta)\\ &=\sin(\alpha)\cos(\beta)+\cos(\alpha)\sin(\beta)\\ &=PE(pos,2i) \cdot \cos(\beta)+PE(pos,2i+1) \cdot \sin(\beta)\\ PE(pos+k,2i+1)&=\cos(\alpha+\beta)\\ &=\cos(\alpha)\cos(\beta)−\sin(\alpha)\sin(\beta)\\ &=PE(pos,2i+1) \cdot \cos(\beta)−PE(pos,2i) \cdot \sin(\beta) \end{align*}

In matrix form, this becomes:

\begin{bmatrix} PE(pos+k,2i)\\ PE(pos+k,2i+1) \end{bmatrix} = \begin{bmatrix} \cos(\beta) & \sin(\beta)\\ -\sin(\beta) & \cos(\beta) \end{bmatrix} \cdot \begin{bmatrix} PE(pos,2i)\\ PE(pos,2i+1) \end{bmatrix}

This is a 2D rotation matrix parameterized solely by $k$ (the relative distance)! The absolute positions $pos$ and $pos+k$ don’t matter — only their difference $k$ determines the transformation. This is the theoretical foundation that should enable relative position modeling.

Why This Matters (In Theory)

The rotation property means that PE(pos+k) can be represented as a linear transformation of PE(pos) where the transformation depends only on the relative offset $k$ , not the absolute positions. During attention computation, when comparing a Query at position $m$ with a Key at position $n$ , the dot product should theoretically capture their relative distance $|m-n|$ rather than forcing the model to memorize all possible $(m,n)$ combinations.

This should enable two powerful properties. First, generalization: the model learns relationships like “attend to 3 tokens back” instead of “attend from position 47 to position 44”. Second, compositionality: relative position patterns learned on short sequences should transfer seamlessly to longer ones without retraining.

The Reality Check

However — and this is the crucial caveat — this elegant property gets partially destroyed in practice. The culprit is how Transformers actually use these encodings. The position encoding is added to token embeddings, then the combined vector is projected through learnable matrices $W_Q, W_K$ that have no knowledge of the underlying rotational structure. The final attention dot product mixes content and position in complex, learned ways that don’t preserve the rotation property. We’ll dissect exactly how this breaks down in the final section.

Q4: Why exponential frequency progression?

The formula $1/10000^{2i/d_{model}}$ creates a geometric (exponential) series of wavelengths. Why not linear spacing?

Think about coverage first. The first dimension pair ( $i=0$ ) has wavelength $2\pi \approx 6.28$ tokens, completing a full cycle every ~6 tokens to capture immediate neighbors. The last dimension pair ( $i=d_{model}/2-1$ ) has wavelength $2\pi \times 10000 \approx 62,800$ tokens, operating at document-level structure. A geometric progression efficiently spans 4-5 orders of magnitude (from ~6 to ~60,000 tokens) with just 256-512 dimensions. You couldn’t pull this off with linear spacing without wasting dimensions.

There’s also a linguistic argument. Language is hierarchically structured: tokens compose into words (~6 tokens), words into phrases (~20 tokens), phrases into sentences (~50 tokens), sentences into paragraphs (~200 tokens), and paragraphs into documents (~1000+ tokens). The exponential scale naturally aligns with this hierarchical structure, letting different dimension groups specialize for different linguistic levels. It’s almost as if the encoding scheme discovered the inherent structure of language through pure mathematical elegance.

Finally, there’s the redundancy problem. If adjacent dimensions had similar frequencies, they’d encode redundant information. The exponential spacing ensures each dimension pair provides unique positional signal, maximizing information density in the embedding space.

Q5: Why 10,000 specifically?

TL;DR: It’s a hyperparameter tuned through experimentation, not derived from first principles. But it has solid reasoning.

Wavelength range analysis

For $d_{model}=512$ , you get a min wavelength of $2\pi \approx 6.28$ (good for local context) and a max wavelength of $2\pi \times 10000 \approx 62,800$ (far exceeds typical training lengths of 512-1024). This ensures uniqueness — even the slowest-changing dimensions don’t complete a full cycle within training sequences — while providing coverage across all scales, from token-to-token changes to global positioning.

What if we used different values?

With base = 100, your max wavelength is only $\approx 628$ tokens. Too short for modeling document structure. You get poor long-range position discrimination. With base = 1,000,000, your max wavelength explodes to $\approx 6.28M$ tokens. Now high-dimensional encodings change so slowly they provide almost no signal on typical sequences. You’d also risk numerical instability.

Frequency spacing

For $d_{model}=512$ , consecutive dimension pairs have frequency ratio $10000^{1/256} \approx 1.036$ . This ~3.6% frequency difference provides smooth, consistent progression — not too sparse (causing information loss) nor too dense (causing redundancy).

Empirical note: Some models (like T5) use base=1,000 with success, showing this is a tunable hyperparameter. The sweet spot depends on expected sequence lengths and model capacity.

There is a caveat…

Despite the mathematical elegance, sinusoidal encoding fails to extrapolate well in practice. When sequences exceed training length, performance degrades. The culprit: additive integration and attention’s nonlinear transformations.

The breakdown (step by step)

Consider tokens at positions $m$ and $n$ :

Step 1: Addition mixes signals

x_m = w_m + p_m \quad (\text{token} + \text{position})

Content and position are now entangled in the same vector space.

Step 2: Linear projections

\begin{align*} q_m &= x_m W_Q = (w_m + p_m)W_Q = w_m W_Q + p_m W_Q\\ k_n &= x_n W_K = (w_n + p_n)W_K = w_n W_K + p_n W_K \end{align*}

The learnable matrices $W_Q, W_K$ have no knowledge of the rotational structure in $p$ .

Step 3: Attention dot product destroys rotational property

\begin{align*} q_m \cdot k_n^T &= (w_m W_Q + p_m W_Q)(w_n W_K + p_n W_K)^T\\ &= \underbrace{w_m W_Q W_K^T w_n^T}_{\text{content-content}} + \underbrace{w_m W_Q W_K^T p_n^T}_{\text{content-pos}} \\ &\quad + \underbrace{p_m W_Q W_K^T w_n^T}_{\text{pos-content}} + \underbrace{p_m W_Q W_K^T p_n^T}_{\text{pos-pos}} \end{align*}

The beautiful relative-position-as-rotation property only exists in the raw $PE(pos)$ vectors. After projection and mixing, it’s destroyed. The model perceives an entangled combination of token-token similarity, token-position interaction, and position-position interaction. The rotation matrix property is not preserved through $W_Q W_K^T$ .

The solution?

Rotary Positional Embedding (RoPE) solves this by applying rotation after projection (to $q$ and $k$ directly), ensuring the rotation operation commutes with the dot product, and making relative position explicit in attention scores. But that’s a story for the next episode.

Key takeaway: Sinusoidal encoding is theoretically elegant but practically compromised by how Transformers integrate it. It’s a clever hack that works well enough for most use cases, but modern architectures like RoPE improve on its limitations by respecting the geometry throughout the attention computation.