KV Cache Headroom Calculator

How much HBM is left for KV cache after loading model weights — and what that means for concurrent requests and max context. AMD MI355X (288 GB) vs NVIDIA B200 (192 GB), memory capacity only.

Model

671B params (37B active) · 61 layers · MLA (rank 512 + 64 RoPE) · config.json

Weight precision

KV cache precision

GPUs

Context length per request

64K tok

MI355X: +100% more concurrent requests per node

602 vs 301 requests at 64K context, DeepSeek-R1-0528 (671B, MoE)

MI355X

FITS

288 GB HBM3E · 8.0 TB/s · 8-GPU node

Usable HBM (×0.9)2074 GB

Model weights671 GB

Free KV-cache memory1387 GB

KV per request @ 64K2.3 GB

Max concurrent requests @ 64K602

Max context @ batch = 1160K (model limit)

B200

FITS

192 GB HBM3E · 8.0 TB/s · 8-GPU node

Usable HBM (×0.9)1382 GB

Model weights671 GB

Free KV-cache memory695 GB

KV per request @ 64K2.3 GB

Max concurrent requests @ 64K301

Max context @ batch = 1160K (model limit)

~100% more concurrent requests per node at comparable node cost (directional; pricing varies by deal).

ℹ Same MLA architecture as DeepSeek-V3. Reasoning traces consume context, so long-context headroom matters more here.

Assumptions & formulas

Weight memory

weights = total_params × bytes_per_param with FP4 = 0.5 B, FP6 = 0.75 B, FP8 = 1 B per parameter.
MoE models keep all experts resident in HBM, so total — not active — parameters are charged. Active params only affect compute, which this tool does not model.
Real checkpoints keep some tensors (embeddings, lm_head, layernorms, sometimes attention) at higher precision than the headline quant, so actual footprints run a few percent above this estimate. GPT-OSS ships MXFP4 (~4.25 bits incl. block scales) on MoE weights only.

KV cache per token

MHA/GQA: 2 × num_layers × num_kv_heads × head_dim × kv_bytes (the 2 is K and V), with FP8 = 1 B, FP16 = 2 B per element.
MLA (DeepSeek V3/R1, Kimi K2): num_layers × (kv_lora_rank + qk_rope_head_dim) × kv_bytes. K and V are reconstructed from one shared compressed latent per token per layer (512 + 64 = 576 elements), so there is no ×2 and no per-head multiplier. For DeepSeek V3 at FP8 that is ~35 KB/token vs ~69 MB/token if you (incorrectly) applied the MHA formula to its 128 heads.
Sliding-window / chunked layers (GPT-OSS, Gemma 3, Llama 4): when the toggle is on, those layers are charged min(context, window) tokens instead of full context. This is the theoretical minimum and requires a hybrid KV allocator (vLLM v1 and recent TRT-LLM/SGLang support this); engines without one allocate full-length KV on every layer — use the toggle off for that case.

Free KV memory

free_kv = (HBM × 0.9 × num_gpus) − weights − (2 GB × num_gpus)
The 0.9 factor mirrors vLLM's default gpu_memory_utilization=0.9.
The 2 GB/GPU covers runtime context, activation workspace, CUDA graphs and collective-comms buffers. It is a typical floor; large-batch or long-prefill deployments can use more.
8-GPU mode assumes tensor/expert-parallel sharding with weights and KV spread evenly across the node; per-GPU replication overheads (e.g. duplicated embeddings at high TP) are ignored.

Derived outputs

max_concurrent = floor(free_kv ÷ kv_per_request(context)), assuming every request sits at the full chosen context. Real serving with mixed-length requests, prefix caching, and paged-KV block reuse does better; this is the conservative dense-occupancy bound.
Max context @ batch=1 is the largest context whose KV fits in free memory, capped at the model's own max_position_embeddings (annotated "model limit" when capped).
The headline compares max_concurrent between the two GPUs at identical model, precision and context settings.
The TCO line is directional only: it restates the concurrency edge as throughput-per-node and assumes comparable node cost. It does not model real pricing, power, or utilization — those vary by deal.

Hardware

MI355X: 288 GB HBM3E, 8.0 TB/s (AMD spec). B200: 192 GB HBM3E, 8.0 TB/s (NVIDIA spec). Bandwidth is shown for context only — throughput is not modeled.
B200 capacity caveat: the B200 chip datasheet says 192 GB, but DGX B200 materials list 1,440 GB per 8 GPUs (180 GB/GPU) and nvidia-smi on shipping parts reports ~183 GiB. If your SKU exposes 180 GB, B200 results here are optimistic by ~6%.
All capacities use decimal GB (1 GB = 10⁹ bytes), consistent with both vendors' specs and with parameter counts in decimal billions.
Where specs are ambiguous, this tool resolves in the B200's favor: it is credited with the full nominal 192 GB rather than the 180 GB that DGX/HGX system materials list as usable, and both GPUs are charged identical utilization factors, overheads and formulas.

Not modeled

Throughput, latency, interconnect, or compute differences (FLOPS).
Paged-KV block fragmentation (typically <2% waste with 16-token blocks).
Prefix caching / shared-prompt deduplication.
Speculative-decoding draft models or LoRA adapters resident in HBM.
CPU/NVMe KV offload — this is HBM-only headroom.

Model configs

Every architecture value comes from the model's official HuggingFace config.json, linked under the model selector. Gated repos (Meta, Mistral, Google) were cross-checked against verbatim public mirrors; anything not directly confirmed is marked TODO_VERIFY in lib/models.ts and carries an unverified badge above.

The Fit — ROCm vs CUDA workload qualifier

Answer seven questions; a hand-authored rules engine (lib/rules.ts, no model calls) scores where AMD/ROCm is a real fit, where it depends, and where Nvidia is simply the right call today. The verdicts are meant to be honest, not flattering.

1. Workload · select all that apply

2. Serving stack · select all that apply

3. Custom CUDA kernels or kernel-level dependencies?

4. Model mix

5. Scale

6. Team depth

7. Will you need new GPU capacity on short notice (within a quarter) at any point in the next 1–2 years?

8 AMD / parity1 depends0 Nvidia

Custom CUDA kernel portability*

AMD / parity

No custom CUDA kernels, so there's nothing Nvidia-specific to port. ROCm runs the standard framework ops, and this isn't a blocker for you.