As large language models scale to longer context windows and serve more concurrent users, the key-value (KV) cache has become a primary memory bottleneck in production inference. For a 30-billion-parameter model with a batch size of 128 and input length of 1,024 tokens, the KV cache can consume up to 180 GB of memory—far exceeding the model's parameters. Compressing the KV cache reduces memory pressure, increases batch sizes, and improves throughput without retraining. Here are ten leading techniques across eviction, quantization, and low-rank methods.
Token Eviction with H2O (Heavy Hitter Oracle)
H2O (NeurIPS 2023) selectively retains a small set of "heavy hitter" tokens that contribute the majority of attention scores, combined with recent tokens, keeping a fixed cache size. This decoding-phase method achieves up to 29× throughput improvement on OPT-30B with minimal accuracy loss, but does not reduce prefill computation.
StreamingLLM (Attention Sink Retention)
StreamingLLM retains the first few tokens (attention sinks) and a sliding window of recent tokens, enabling infinite-length streams. It is fast and hardware-friendly but may discard semantically important middle-context tokens, making it ideal for streaming dialogue where recency matters most.
SnapKV (Observation Window Compression)
SnapKV targets the prefill stage for long prompts by compressing the KV cache during preprocessing, reducing memory and computation overhead before generation begins.
Further methods include quantization techniques like KIVI and GEAR, and low-rank approaches such as LoRA-based cache compression, each offering different trade-offs between compression ratio, accuracy, and speed.