KV Cache Headroom Calculator

How much HBM is left for KV cache after loading model weights — and what that means for concurrent requests and max context. AMD MI355X (288 GB) vs NVIDIA B200 (192 GB), memory capacity only.

671B params (37B active) · 61 layers · MLA (rank 512 + 64 RoPE) · config.json
64K tok
MI355X: +100% more concurrent requests per node
602 vs 301 requests at 64K context, DeepSeek-R1-0528 (671B, MoE)

MI355X

FITS
288 GB HBM3E · 8.0 TB/s · 8-GPU node
Usable HBM (×0.9)2074 GB
Model weights671 GB
Free KV-cache memory1387 GB
KV per request @ 64K2.3 GB
Max concurrent requests @ 64K602
Max context @ batch = 1160K (model limit)

B200

FITS
192 GB HBM3E · 8.0 TB/s · 8-GPU node
Usable HBM (×0.9)1382 GB
Model weights671 GB
Free KV-cache memory695 GB
KV per request @ 64K2.3 GB
Max concurrent requests @ 64K301
Max context @ batch = 1160K (model limit)
~100% more concurrent requests per node at comparable node cost (directional; pricing varies by deal).
Same MLA architecture as DeepSeek-V3. Reasoning traces consume context, so long-context headroom matters more here.
Assumptions & formulas

Weight memory

  • weights = total_params × bytes_per_param with FP4 = 0.5 B, FP6 = 0.75 B, FP8 = 1 B per parameter.
  • MoE models keep all experts resident in HBM, so total — not active — parameters are charged. Active params only affect compute, which this tool does not model.
  • Real checkpoints keep some tensors (embeddings, lm_head, layernorms, sometimes attention) at higher precision than the headline quant, so actual footprints run a few percent above this estimate. GPT-OSS ships MXFP4 (~4.25 bits incl. block scales) on MoE weights only.

KV cache per token

  • MHA/GQA: 2 × num_layers × num_kv_heads × head_dim × kv_bytes (the 2 is K and V), with FP8 = 1 B, FP16 = 2 B per element.
  • MLA (DeepSeek V3/R1, Kimi K2): num_layers × (kv_lora_rank + qk_rope_head_dim) × kv_bytes. K and V are reconstructed from one shared compressed latent per token per layer (512 + 64 = 576 elements), so there is no ×2 and no per-head multiplier. For DeepSeek V3 at FP8 that is ~35 KB/token vs ~69 MB/token if you (incorrectly) applied the MHA formula to its 128 heads.
  • Sliding-window / chunked layers (GPT-OSS, Gemma 3, Llama 4): when the toggle is on, those layers are charged min(context, window) tokens instead of full context. This is the theoretical minimum and requires a hybrid KV allocator (vLLM v1 and recent TRT-LLM/SGLang support this); engines without one allocate full-length KV on every layer — use the toggle off for that case.

Free KV memory

  • free_kv = (HBM × 0.9 × num_gpus) − weights − (2 GB × num_gpus)
  • The 0.9 factor mirrors vLLM's default gpu_memory_utilization=0.9.
  • The 2 GB/GPU covers runtime context, activation workspace, CUDA graphs and collective-comms buffers. It is a typical floor; large-batch or long-prefill deployments can use more.
  • 8-GPU mode assumes tensor/expert-parallel sharding with weights and KV spread evenly across the node; per-GPU replication overheads (e.g. duplicated embeddings at high TP) are ignored.

Derived outputs

  • max_concurrent = floor(free_kv ÷ kv_per_request(context)), assuming every request sits at the full chosen context. Real serving with mixed-length requests, prefix caching, and paged-KV block reuse does better; this is the conservative dense-occupancy bound.
  • Max context @ batch=1 is the largest context whose KV fits in free memory, capped at the model's own max_position_embeddings (annotated "model limit" when capped).
  • The headline compares max_concurrent between the two GPUs at identical model, precision and context settings.
  • The TCO line is directional only: it restates the concurrency edge as throughput-per-node and assumes comparable node cost. It does not model real pricing, power, or utilization — those vary by deal.

Hardware

  • MI355X: 288 GB HBM3E, 8.0 TB/s (AMD spec). B200: 192 GB HBM3E, 8.0 TB/s (NVIDIA spec). Bandwidth is shown for context only — throughput is not modeled.
  • B200 capacity caveat: the B200 chip datasheet says 192 GB, but DGX B200 materials list 1,440 GB per 8 GPUs (180 GB/GPU) and nvidia-smi on shipping parts reports ~183 GiB. If your SKU exposes 180 GB, B200 results here are optimistic by ~6%.
  • All capacities use decimal GB (1 GB = 10⁹ bytes), consistent with both vendors' specs and with parameter counts in decimal billions.
  • Where specs are ambiguous, this tool resolves in the B200's favor: it is credited with the full nominal 192 GB rather than the 180 GB that DGX/HGX system materials list as usable, and both GPUs are charged identical utilization factors, overheads and formulas.

Not modeled

  • Throughput, latency, interconnect, or compute differences (FLOPS).
  • Paged-KV block fragmentation (typically <2% waste with 16-token blocks).
  • Prefix caching / shared-prompt deduplication.
  • Speculative-decoding draft models or LoRA adapters resident in HBM.
  • CPU/NVMe KV offload — this is HBM-only headroom.

Model configs

  • Every architecture value comes from the model's official HuggingFace config.json, linked under the model selector. Gated repos (Meta, Mistral, Google) were cross-checked against verbatim public mirrors; anything not directly confirmed is marked TODO_VERIFY in lib/models.ts and carries an unverified badge above.