Google Research published TurboQuant, an algorithm presented at ICLR 2026 that compresses LLM key-value caches to 3–4 bits with zero accuracy loss and no model retraining required. The two-stage method combines PolarQuant (random vector rotation before quantization) with a Quantized Johnson-Lindenstrauss error corrector, yielding 6x memory reduction and up to 8x faster attention on H100 GPUs. The technique directly addresses one of the primary bottlenecks for long-context and large-batch inference.

Google TurboQuant achieves 6x LLM KV cache compression at ICLR 2026

Citations