⚡ Infrastructure Implication
This 10×10 matrix represents attention for a 10-token sentence. Scale this to production:
KV cache stores K and V vectors for all previous tokens, eliminating recomputation on each new token. Memory cost grows linearly with context — this is why 1M token windows require tens of GB of GPU memory per active session.
Attention From Selected Token
Q · K · V Mechanism
Every token generates three vectors from its embedding:
Q — "what am I looking for?"
K — "what do I contain?"
V — "what do I pass forward?"
Attention score = Q·K / √d
Output = softmax(scores) · V
The √d scaling prevents dot products from growing large enough to saturate softmax — which would kill gradients during training.
Scaled Dot-Product Attention
softmax( Q·Kᵀ / √dₖ ) · V
// dₖ = head dimension (d/h)
// h = number of heads
// Repeated h times in parallel
// → Multi-Head Attention