TANGRAM Icon Tangram: Unlocking Non-Uniform KV Cache
for Efficient Multi-turn LLM Serving

Hyungmin Kim*, Minsoo Kim*, Jungwook Choi
Hanyang University
{kong4274, minsoo2333, choij}@hanyang.ac.kr

Evaluation Highlights

2.6×

Throughput Improvement

Memory Savings

Minimal

Accuracy Loss

Evaluated on Qwen3-4B, Qwen2.5-7B-1M, and Qwen2.5-32B across SCBench, LoCoMo, RealTalk, and LongMemEval, Tangram preserves conversational accuracy while significantly outperforming state-of-the-art serving frameworks such as vLLM.

Abstract

Multi-turn Large Language Model (LLM) serving is critical for consistent user experiences, yet the linear growth of the Key-Value (KV) cache imposes significant pressure on GPU memory and bandwidth. Non-uniform KV compression effectively preserves more information by considering the individual importance of each KV cache. However, such KV cache heterogeneity introduces various systemic challenges—including memory fragmentation, scheduling complexities, and diminished kernel utilization—which collectively lead to significant inefficiencies in existing LLM serving systems.

To overcome these challenges, we present Tangram, a novel serving system designed to make Non-uniform KV caches practical. Tangram addresses systemic inefficiencies through three core techniques: (1) Deterministic Budget Allocation assigns a static memory footprint to each head based on its intrinsic pattern, entirely eliminating dynamic scheduling overhead; (2) Head Group Page clusters attention heads with similar retention demands and manages them with independent, vectorized page tables, thereby maximizing physical memory reclamation; and (3) Ahead-of-Time (AOT) Load Balancing leverages static budget profiles to ensure uniform GPU utilization without runtime overhead. Experimental results show that Tangram improves throughput by up to 2.6× compared to existing baselines, while fully preserving model accuracy.

Challenge: Multi-turn History Growth

Problem Statement: KV Cache Explosion in Multi-turn LLM Serving

In multi-turn serving, each turn appends to the dialogue history (Ht), making the KV cache scale linearly with the number of turns. Even at moderate batch sizes, this footprint surpasses the model weights themselves, becoming the primary bottleneck for scalability and throughput. KV cache compression is therefore essential, but uniform compression discards context-essential information, motivating Non-uniform KV compression.

Why Non-uniform KV Cache?

Head-wise Heterogeneity

The Insight: Head-wise Concentration Heterogeneity

Attention heads exhibit diverse concentration patterns: some focus on a few critical tokens while others spread across the context. By assigning each head a budget proportional to its retrieval role, Non-uniform compression produces up to a 42× disparity in per-head KV size while preserving conversational accuracy at aggressive compression rates.

Systemic Challenges of Non-uniform KV Cache

Systemic Challenges

Modern serving stacks (PagedAttention, Continuous Batching, FlashDecoding/FlashInfer) are co-designed under a uniform-KV assumption. Non-uniform compression breaks this assumption, exposing three fundamental limitations:

1. Monolithic Page Structure → Page Fragmentation

A unified physical block spans all layers and heads simultaneously, so per-head heterogeneity cannot be reclaimed. Each page is sized by the longest-retaining head (Lmax), leaving the others mostly empty and locking compressed memory inside structural dead space.

2. Dynamic Reclamation Bottleneck

Reclaiming scattered pages on-the-fly forces the scheduler to track and re-plan compressed footprints at runtime. This CPU overhead degrades overall system throughput.

3. Workload Imbalance across GPU SMs

Uniform KV splits become "stragglers" under heterogeneous lengths, inflating decode attention latency by up to 1.7×. Dynamic rebalancing (FlashInfer-style) restores utilization but loses plan reuse across layers, costing 15–20% of decode time per step.

Core Methodologies

1. Deterministic Budget Allocation

Replaces runtime-decided compression with offline-profiled, per-head static budgets, eliminating the dynamic compress-and-reclaim bottleneck and enabling precise page planning.

2. Head Group PagedAttention

Clusters heads by budget similarity and assigns each group an independent page table. A Vectorized Block Table (SIMD + OpenMP) keeps fine-grained grouping CPU-cheap.

3. AOT Load Balancing

Precomputes a per-layer Workload Partition Table from static budget profiles, achieving balanced SM utilization with zero runtime planning overhead.

Tangram System Overview
Tangram system overview. Three methodologies cooperate end-to-end across the serving stack: (a) Deterministic Budget Allocation profiles each head's KV retention offline and assigns a static, non-uniform budget per head; (d) Head Group Page clusters heads by budget similarity and gives each group its own page table — bounding allocation by the local maximum instead of the global one and reclaiming structural dead space; (e) AOT Load Balancing precomputes a Split-KV table from the static budgets so GPU SMs receive balanced work with zero per-step planning cost.

Deterministic Budget Allocation

Tangram eliminates the dynamic evict-and-reclaim bottleneck by replacing runtime-decided compression with a static, offline-profiled budget for every head.

Observation: Head-wise Budgets are Model-Intrinsic

Per-head retention rates are highly heterogeneous within a layer, yet each head's budget is input-independent and stable across samples and domains (verified across Qwen2.5-7B, Qwen3-4B, LLaMA3.1-8B). This reframes "runtime uncertainty" as a statically resolvable model property.

Stability of Head-Wise Budgets

Per-head retained KV sizes are nearly invariant across 50 sample inputs.

Each head's static budget is profiled from just 50 pilot samples, enabling precise planning and fused prefill+compression without runtime overhead.

Head Group PagedAttention

Illustration of Head Group PagedAttention
Head Group PagedAttention. Heads are sorted by static budget and clustered into groups of size G; each group manages its own independent page table, so allocation is dictated by the local — not global — maximum.

Budget-Aware Head Clustering

For each layer, heads are sorted by their offline budget Bℓ,h and partitioned into H/G groups. Page capacity is bounded by the local maximum within a group, tightly aligning allocations to actual demand and reclaiming the dead space caused by short-retention heads.

Vectorized Block Table

Fine-grained grouping naively raises CPU scheduling cost to O(Nreq × H/G). Tangram batches block-table operations with OpenMP across groups and SIMD (AVX-512) within a group, keeping CPU overhead negligible even at small G.

Fragmentation vs. Overhead Trade-off
Group size G trades off fragmentation against block-table overhead. The Vectorized Block Table shifts the overhead curve down, letting Tangram pick a small G for maximum memory reclamation without throughput loss.

Ahead-of-Time (AOT) Load Balancing

Heterogeneous KV lengths skew per-thread-block workloads, inflating decode attention latency by up to 1.7×. Dynamic rebalancing recovers SM utilization but loses plan reuse across layers, paying 15–20% of decode time per step. Tangram avoids both:

Static Workload Partition Table: Because per-head budgets are fixed, CTAs are distributed across head groups proportional to aggregated group budget, computed once offline.
Zero Runtime Planning: The kernel simply reads the precomputed table at each step, eliminating per-layer planner cost while preserving balanced SM execution.

The result: load-balanced GPU utilization without the latency penalty of online planning.

Evaluation Highlights

2.6×

Throughput Improvement

Memory Savings

Minimal

Accuracy Loss

Evaluated on Qwen3-4B, Qwen2.5-7B-1M, and Qwen2.5-32B across SCBench, LoCoMo, RealTalk, and LongMemEval, Tangram preserves conversational accuracy while significantly outperforming state-of-the-art serving frameworks such as vLLM.

Accuracy Results

Throughput gains are only meaningful if accuracy holds. Compared against uniform compression (SnapKV) and non-uniform compression (KVzip, FastKVzip), Tangram's static-budget allocation tracks the full-KV baseline across Short, Mid, and Long context scales — even at aggressive retention rates.

Accuracy Results
Accuracy vs. KV Retention Rate on SCBench / LoCoMo / RealTalk / LongMemEval across Qwen3-4B, Qwen2.5-7B-1M, and Qwen2.5-32B (TP=2). Tangram maintains near-baseline accuracy where SnapKV and KVzip variants degrade.

Throughput Breakdown

Built on top of vLLM, Tangram is evaluated end-to-end against the vanilla baseline across multi-turn workloads. By reclaiming structural dead space and eliminating control-plane overhead, it converts theoretical compression gains into real serving throughput.

Memory Usage and Throughput Comparison
Memory usage and end-to-end throughput on Qwen3-4B, Qwen2.5-7B-1M, and Qwen2.5-32B (TP=2).

Key Evaluation Insights

  • Breaking the Memory Cap: vLLM's monolithic page structure caps the runnable batch size early; Head Group PagedAttention reclaims dead space to sustain far more concurrent requests at the same memory footprint.
  • Saturation Resilience: While vLLM saturates as KV pressure rises, Tangram continues to scale, achieving up to 2.6× throughput under identical KV cache budgets.
  • Effective Memory Savings: When pushed to high compression, Tangram delivers up to effective memory savings — translating algorithmic compression into system-level gains.
Latency Breakdown across Compression Rates (Top) Throughput by Head Group Size (Bottom)
Decomposing where Tangram's gains come from.

(Top) Latency breakdown across various compression rates. Dynamic allocation (Dyn. Comp.) incurs significant page reclamation overhead that scales with the eviction rate — adding up to +24.9% latency on Qwen3-4B and +20.0% on Qwen2.5-7B at 75% compression. In contrast, Tangram's deterministic budget allocation operates with zero page reclaim overhead, keeping end-to-end latency flat across all compression rates and isolating model execution as the only meaningful cost.

(Bottom) Throughput across various compression rates by head group size (G). The group size G governs the fragmentation–overhead trade-off: small G tightly packs per-head budgets to minimize fragmentation but inflates block-table management cost, while large G amortizes management overhead yet fails to effectively reclaim memory from short-retention heads. The sweet spot sits at moderate G across all three models (Qwen3-4B, Qwen2.5-7B-1M, Qwen2.5-32B with TP=2), confirming that Tangram's Vectorized Block Table makes small-G regimes practical without paying the CPU overhead penalty.

BibTeX

-