Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving

Compression Gains, Fully Realized in Real Serving

End-to-end request throughput of Tangram vs. vLLM across KV cache ratios on Qwen3-4B, Llama-3.1-8B, Gemma-3-12B, and GPT-OSS-20B

KV cache compression on vLLM with Ragged Paging
Non-uniform and uniform KV cache compression, natively integrated into vLLM.

Seamless vLLM integration
Fully compatible with paged attention, continuous batching, chunked prefill, and CUDA graph mode.

Real memory reclamation
Compressed KV cache is actually freed, turning memory savings into higher serving throughput.

Zero runtime scheduling overhead
Budget reservation and ahead-of-time (AOT) load balancing keep compression off the critical path.

Abstract

Multi-turn LLM serving accumulates dialogue history whose Key-Value (KV) cache grows with every turn and every user, quickly exceeding the model weights themselves and making memory—not compute—the binding constraint on throughput. Non-uniform KV compression, which allocates heterogeneous budgets across attention heads, preserves accuracy far better than uniform schemes, yet remains impractical: modern serving stacks force every head to the same KV length, so compressing a specific head is structurally impossible—the memory it would free collapses back into page fragmentation. Realizing heterogeneity instead spends up to 25% of prefill time reclaiming scattered pages, and skews GPU workloads that inflate decode latency by up to 1.7× or burn 15–20% of each decode step on re-planning.

We observe that this heterogeneity need not be discovered at runtime: head-wise retention follows a two-level structural regularity—an input-invariant head ranking with narrowly bounded per-head ratios—that can be calibrated offline from as few as 50 samples. Building on this insight, we present Tangram, a serving framework that statically resolves what prior systems handle dynamically: (1) Budget Reservation fixes each head's post-compression footprint at scheduling time, eliminating page reclamation; (2) Ragged Paging clusters similar-budget heads into independent page tables, turning fragmentation into reclaimable memory; and (3) Ahead-of-Time (AOT) Load Balancing precomputes balanced GPU partitions with zero runtime planning. Implemented on vLLM, Tangram serves as a drop-in substrate for existing non-uniform compression methods, matching their accuracy while improving end-to-end throughput by up to 2.6× over the full-KV baseline.

KV cache growth and uniform vs. non-uniform compression

Problem Statement: KV Cache Explosion in Multi-turn LLM Serving

In multi-turn serving, each turn appends to the dialogue history (H_t), so the KV cache scales linearly with the number of turns and concurrent users (a). For Qwen2.5-32B, just 16 concurrent sessions surpass the model weights themselves within ten turns and keep growing—making memory, not compute, the binding constraint on batch size and throughput. Compression is therefore essential, but how the budget is distributed across heads decides whether accuracy survives (b): uniform schemes starve the few retrieval heads that carry long-range information, degrading accuracy, whereas non-uniform compression preserves it (c).

Why non-uniform? Per-head KV retention is highly heterogeneous within a layer—a small subset of retrieval heads carries long-range context while most attend only locally. A budget that mirrors this skew preserves accuracy under aggressive compression, exactly where uniform truncation collapses. The question for a real serving system is whether this heterogeneity is predictable enough to plan around—it is, and that observation is what Tangram is built on.

Systemic Challenges of Non-uniform KV Cache

The serving stack—PagedAttention, continuous batching with chunked prefill, and optimized attention kernels (FlashDecoding/FlashInfer)—is architected end-to-end around a single implicit assumption: all attention heads hold KV caches of identical length. Non-uniform compression violates this assumption at every layer of the stack, exposing three fundamental limitations—and each one motivates a corresponding Tangram technique:

1. Monolithic Page Structure → Per-Head Compression Is Impossible

A request's block table records a single KV length, so every head must be managed at the same length—there is simply no way to hand an individual head a shorter cache. Compressing a specific head is therefore structurally impossible: even after a head's entries are evicted, its slots stay pinned to the longest-retaining head, and the memory compression should have freed collapses into page fragmentation.

2. Page Management Overhead

Each head's post-compression footprint is unknown until the forward pass computes importance scores at runtime. The scheduler must therefore over-allocate, then run a costly compress-and-reclaim pass—identifying scattered freed pages, returning them to the pool, and remapping page tables in flight. This control-plane churn consumes up to 25% of prefill execution time.

3. Workload Imbalance across GPU SMs

FlashDecoding's static, uniform KV splits become stragglers under heterogeneous lengths, inflating decode attention latency by up to 1.7×. FlashInfer's dynamic rebalancing restores utilization but loses plan reuse: a unique partition must be recomputed for every layer at every step, burning 15–20% of each decode iteration on the CPU.

**Workload imbalance on decode attention.** Heterogeneous per-head KV lengths **(b)** skew per-thread-block work versus the uniform case **(a)**; the decode step is gated by the heaviest blocks, inflating attention latency by up to **1.7×** at the same total KV size **(c)**.

Core Methodologies

Because head-wise retention is a stable, model-intrinsic structure, Tangram calibrates it once, offline and turns every runtime burden of non-uniformity into a deterministic, pre-scheduled decision. Each limitation above maps to exactly one technique:

Monolithic Page Structureone KV length forced on every head—per-head compression impossible → Ragged Pagingper-group page tables reclaim the freed capacity

Page Management Overheadruntime compress-and-reclaim, up to 25% of prefill → Budget Reservationexact pages reserved before execution

Workload Imbalancestragglers (1.7×) or 15–20% re-planning → AOT Load Balancingpartition precomputed offline, zero runtime cost

1. Budget Reservation

Fixes each head's budget to its offline-calibrated value, so the scheduler reserves exactly the post-compression pages at scheduling time—eliminating over-allocation and the entire compress-and-reclaim path.

2. Ragged Paging

Breaks the single-length constraint: per-group page tables let similar-budget heads be managed—and compressed—at their own length, so the capacity compression frees becomes physically reclaimable. A Vectorized Block Table (OpenMP + SIMD) keeps the added control-plane cost negligible.

3. AOT Load Balancing

Precomputes a workload partition map from the reserved budget profiles, delivering balanced SM utilization with zero runtime planning overhead.

Tangram System Overview — **Tangram system overview.** The three techniques compose end-to-end: **Budget Reservation** reserves each head's static budget before execution, **Ragged Paging** gives similar-budget head groups independent page tables to reclaim fragmented memory, and **AOT Load Balancing** precomputes a balanced Split-KV partition.

Budget Reservation

Tangram eliminates the dynamic compress-and-reclaim bottleneck by replacing runtime-decided compression with a static, offline-calibrated budget for every head, letting the scheduler reserve exactly the required pages before execution.

Observation: A Two-Level Structural Regularity

Head-wise retention is intrinsic to the model, not driven by the input: the ranking of heads by retention demand is essentially input-invariant, and each head's absolute ratio varies only within a narrow, estimable band. In the figure, heads are sorted by their calibrated budget; the per-task markers (Summary / Code / Retrieve) rise monotonically along that single ordering rather than reshuffling it, and the narrow boxes show low variance across 50 samples—reframing "runtime uncertainty" as a statically resolvable model property.

Per-head KV retention with reserved static budget

Per-head retention rates (KVzip, 50% target ratio) across Llama-3.1-8B, Gemma-3-12B, and GPT-OSS-20B. Dashed lines mark the static budget Tangram reserves with safety coefficient α=2.

Each head's static budget is calibrated from just 50 pilot samples with a small safety margin (α=2), which absorbs per-input deviation. Because Tangram fixes only the budget and leaves each method's scoring function untouched, it is a drop-in substrate for any non-uniform compression method—adding no cost to the serving path.

Ragged Paging

The monolithic page forces one KV length on every head. Ragged Paging lifts that constraint—giving similar-budget head groups their own page tables—so a head can finally be compressed at its own length, and the freed memory is actually returned to the pool.

Budget-Aware Head Clustering

Heads are sorted by their offline budget B_ℓ,h and partitioned into groups of H_p heads. Each group's footprint is bounded by its local maximum, tightly tracking the retention its members actually need. Co-locating similar-budget heads—rather than adjacent ones—reclaims an additional 12–25% of the full KV cache.

Vectorized Block Table

Fine-grained grouping naively raises CPU scheduling cost to O(N_req × H/H_p). Tangram aggregates block-table operations with OpenMP across groups and SIMD (AVX-512) within a group, keeping CPU overhead negligible even at small H_p.

Fragmentation vs. management-overhead trade-off — **(a)** A unified page (vLLM) is sized by the longest head, so most capacity is fragmentation. **(b)** Ragged Paging (H_p=4) allocates and reclaims per group. **(c)** H_p trades fragmentation against block-table overhead; the Vectorized Block Table lowers that curve, making small H_p practical.

Memory reclaimed with budget-aware clustering — **Effectiveness of Budget-Aware Clustering.** Grouping heads of similar budget reclaims an additional **12–25%** of the full KV cache over grouping adjacent heads at the same H_p, consistently across Qwen3-4B, Llama-3.1-8B, Gemma-3-12B, and GPT-OSS-20B.

Ahead-of-Time (AOT) Load Balancing

Heterogeneous KV lengths skew per-thread-block workloads, inflating decode attention latency by up to 1.7×. Dynamic rebalancing recovers SM utilization but loses plan reuse across layers, paying 15–20% of decode time per step. Tangram avoids both:

Static Workload Partition Table: Because per-head budgets are fixed, CTAs are distributed across head groups proportional to aggregated group budget, computed once offline.

Zero Runtime Planning: The kernel simply reads the precomputed table at each step, eliminating per-layer planner cost while preserving balanced SM execution.

The result: load-balanced GPU utilization without the latency penalty of online planning.

Decode attention latency: FlashInfer vs FlashDecoding vs AOT-LB — **Decode attention latency** across compression rates (batch size 4). AOT Load Balancing consistently achieves the lowest latency: FlashDecoding suffers stragglers from heuristic static partitioning, while FlashInfer pays to recompute per-layer partitions every step. Tangram pre-computes the partition offline, balancing SM utilization at zero runtime cost.

Evaluation Setup

Models

Five dense & Mixture-of-Experts models, all with >100K context: Qwen3-4B, Llama-3.1-8B, Gemma-3-12B, GPT-OSS-20B, Qwen3-30B-A3B.

Workload

SCBench — shared-context, multi-turn tasks split into Short (<20K), Mid (20–100K), and Long (>100K). Budgets calibrated offline from 50 pilot samples (α=2).

Methods & Baselines

Drop-in over Ada-SnapKV, Expected Attention, FastKVzip; compared against full-KV vLLM and the FlashDecoding / FlashInfer kernels.

Built on vLLM with custom FlashAttention CUDA kernels, on 4× NVIDIA A100 80GB. All throughput, latency, and fragmentation numbers are measured on real hardware — not simulated.

Accuracy Results

Throughput gains are only meaningful if accuracy holds. A natural worry is that freezing budgets offline sacrifices the input-adaptivity that makes non-uniform compression accurate—our results show the opposite. For each method (Ada-SnapKV, Expected Attention, FastKVzip), running it under Tangram (w/ Tangram) closely tracks its original implementation (w/o Tangram) across compression ratios—and in some cases exceeds it—while running free of the memory inefficiencies that made it impractical to serve.

Accuracy: w/ Tangram vs w/o Tangram vs Full-KV — **Accuracy vs. compression ratio** on SCBench, across all five models and three methods. Despite pinning each head group to a static offline budget, *w/ Tangram* matches the original *w/o Tangram* accuracy—a faithful substrate, not a different compressor.

Throughput Breakdown

Built on top of vLLM, Tangram is evaluated end-to-end against the vanilla baseline across multi-turn workloads. By reclaiming fragmented memory and removing control-plane overhead, it converts a method's algorithmic compression into real serving throughput—up to 2.6× over the full-KV baseline, with the gain growing as context length increases from Short to Long.

Key Evaluation Insights

Compression, fully realized: the 2.6× gain combines the capacity freed by compression with the system efficiency Tangram adds. Tangram's contribution is not the compression ratio (inherited from the underlying method) but the conversion of that ratio into realized throughput.
Additive techniques: Budget Reservation, Ragged Paging, and AOT Load Balancing each contribute independently—together bridging the gap between theoretical KV-cache reduction and served performance.
Latency under load: under 75% eviction, Tangram sustains low TTFT as request rate grows, whereas vLLM's TTFT rises sharply.

Prefill latency breakdown: static vs dynamic allocation — **Page reclamation, eliminated.** Dynamic allocation (D) pays page-reclaim overhead (orange) that grows with the eviction rate—up to **+24.9%** prefill latency (Qwen3-4B) and **+20.0%** (Llama-3.1-8B). Budget Reservation (S) reserves the exact footprint up front: **zero** reclaim cost.

Throughput versus heads per page — **Throughput vs. heads per page (H_p).** Small H_p cuts fragmentation but adds page-table cost; large H_p reclaims less. The sweet spot **H_p=4–8** holds across all models and eviction rates—the Vectorized Block Table makes small H_p practical.

TTFT versus request rate — **TTFT under load** (30K avg length, 75% evicted). The capacity Tangram frees keeps time-to-first-token flat as request rate climbs, while vLLM's TTFT spikes.

BibTeX

@misc{kim2026tangramunlockingnonuniformkv,
      title={Tangram: Unlocking Non-Uniform KV Cache Compression for Efficient Multi-turn LLM Serving}, 
      author={Hyungmin Kim and Minsoo Kim and Hongseok Kim and Jungwook Choi},
      year={2026},
      eprint={2606.06302},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2606.06302}, 
}