Portfolio Project P7 · AI PM Master Curriculum · Day 6

Transformer Attention
Visualizer

Interactive visualization of the scaled dot-product attention mechanism. Click any token to inspect its attention distribution across the full context window.

Attention Matrix — all tokens
low
high

⚡ Infrastructure Implication

This 10×10 matrix represents attention for a 10-token sentence. Scale this to production:

128K
GPT-4 context
16B
ops per layer
O(n²)
compute scaling

KV cache stores K and V vectors for all previous tokens, eliminating recomputation on each new token. Memory cost grows linearly with context — this is why 1M token windows require tens of GB of GPU memory per active session.

Attention From Selected Token

"she"

Q · K · V Mechanism

Every token generates three vectors from its embedding:

Q — "what am I looking for?"
K — "what do I contain?"
V — "what do I pass forward?"

Attention score = Q·K / √d
Output = softmax(scores) · V

The √d scaling prevents dot products from growing large enough to saturate softmax — which would kill gradients during training.

Scaled Dot-Product Attention

Attention(Q,K,V) =
 softmax( Q·Kᵀ / √dₖ ) · V

// dₖ = head dimension (d/h)
// h = number of heads
// Repeated h times in parallel
// → Multi-Head Attention