Transformer Attention Visualizer

Attention Matrix — all tokens

low

high

⚡ Infrastructure Implication

This 10×10 matrix represents attention for a 10-token sentence. Scale this to production:

128K

GPT-4 context

16B

ops per layer

O(n²)

compute scaling

KV cache stores K and V vectors for all previous tokens, eliminating recomputation on each new token. Memory cost grows linearly with context — this is why 1M token windows require tens of GB of GPU memory per active session.

Attention From Selected Token

"she"

Q · K · V Mechanism

Every token generates three vectors from its embedding:

Q — "what am I looking for?"
K — "what do I contain?"
V — "what do I pass forward?"

Attention score = Q·K / √d
Output = softmax(scores) · V

The √d scaling prevents dot products from growing large enough to saturate softmax — which would kill gradients during training.

Scaled Dot-Product Attention

Attention(Q,K,V) =
softmax( Q·Kᵀ / √dₖ ) · V

// dₖ = head dimension (d/h)
// h = number of heads
// Repeated h times in parallel
// → Multi-Head Attention