Superficial Drive

Why transformers normalize, LayerNorm vs RMSNorm internals, pre-norm gradient highways, and BatchNorm side notes—all in a lean, hack-ready walkthrough.

Engineering dissection of Rotary Positional Embedding (RoPE) mechanics, scaling hacks, and MiniMind implementation details

Dissecting sinusoidal positional embedding in Transformers model

Dissecting BPE tokenization — the critical first layer between human text and neural networks

Copyright 2025
Sitemap