LLM Explorer – KV-Cache-Rechner

Modell-Konfiguration

Anzahl Layers (L) Transformer-Blocks im Stack

Query Heads (h_q) Anzahl Query-Heads (Multi-Head Attention)

KV Heads (h_kv) MHA: = h_q | GQA: < h_q | MQA: = 1

Head Dimension (d_k) Dimension pro Head (typisch 64, 80 oder 128)

Sequenzlänge (Tokens) Anzahl Tokens im Kontext

Batch Size Anzahl paralleler Sequenzen

Präzision Bytes pro Parameter im Cache

KV-Cache Größe

Cache pro Token

0 KB

Cache für Sequenz

0 MB

Cache mit Batch

0 GB

K-Cache

0 GB

V-Cache

0 GB

KV-Cache Formel

Cache = 2 × Layers × KV_Heads × Head_Dim × Seq_Len × Bytes × Batch_Size

Faktor 2: K und V getrennt gespeichert
GQA-Vorteil: KV_Heads < Q_Heads → proportionale Reduktion
MQA-Maximum: KV_Heads = 1 → maximale Ersparnis

Warum KV-Cache?

Ohne Cache: Jedes neue Token → volle Attention über alle vorherigen (O(n²) pro Token). Mit Cache: Nur Query neu, K/V aus Speicher (O(n) pro Token). 5-10× Speedup!

GQA Trade-off

Grouped Query Attention: Mehrere Q-Heads teilen sich KV-Heads. Llama 2 70B: 64 Q → 8 KV = 8× Cache-Reduktion bei <1% Qualitätsverlust. Best of Both Worlds.

Präzisions-Optionen

FP16 (Standard): Gute Balance. INT8: 2× Speicher-Reduktion, minimal Qualitätsverlust. INT4: 4× Reduktion, spürbarer aber oft akzeptabler Verlust. FP32: Nur für Research.

GPU-Limits

A100 80GB: ~64K Tokens (Llama 2 70B, FP16). H100 80GB: ähnlich. Mit INT8: ~128K. Mit GQA statt MHA: 8× mehr. Kombination ermöglicht 1M+ Token Kontext.

KV-Cache-Rechner

Modell-Konfiguration

KV-Cache Größe

MHA vs. GQA vs. MQA Vergleich

Cache-Wachstum mit Sequenzlänge