How much HBM is left for KV cache after loading model weights — and what that means for concurrent requests and max context. AMD MI355X (288 GB) vs NVIDIA B200 (192 GB), memory capacity only.
weights = total_params × bytes_per_param with FP4 = 0.5 B, FP6 = 0.75 B, FP8 = 1 B per parameter.2 × num_layers × num_kv_heads × head_dim × kv_bytes (the 2 is K and V), with FP8 = 1 B, FP16 = 2 B per element.num_layers × (kv_lora_rank + qk_rope_head_dim) × kv_bytes. K and V are reconstructed from one shared compressed latent per token per layer (512 + 64 = 576 elements), so there is no ×2 and no per-head multiplier. For DeepSeek V3 at FP8 that is ~35 KB/token vs ~69 MB/token if you (incorrectly) applied the MHA formula to its 128 heads.min(context, window) tokens instead of full context. This is the theoretical minimum and requires a hybrid KV allocator (vLLM v1 and recent TRT-LLM/SGLang support this); engines without one allocate full-length KV on every layer — use the toggle off for that case.free_kv = (HBM × 0.9 × num_gpus) − weights − (2 GB × num_gpus)gpu_memory_utilization=0.9.max_concurrent = floor(free_kv ÷ kv_per_request(context)), assuming every request sits at the full chosen context. Real serving with mixed-length requests, prefix caching, and paged-KV block reuse does better; this is the conservative dense-occupancy bound.max_position_embeddings (annotated "model limit" when capped).max_concurrent between the two GPUs at identical model, precision and context settings.nvidia-smi on shipping parts reports ~183 GiB. If your SKU exposes 180 GB, B200 results here are optimistic by ~6%.config.json, linked under the model selector. Gated repos (Meta, Mistral, Google) were cross-checked against verbatim public mirrors; anything not directly confirmed is marked TODO_VERIFY in lib/models.ts and carries an unverified badge above.Answer seven questions; a hand-authored rules engine (lib/rules.ts, no model calls) scores where AMD/ROCm is a real fit, where it depends, and where Nvidia is simply the right call today. The verdicts are meant to be honest, not flattering.
Rows marked * rest on general knowledge of the ROCm/CUDA ecosystem as of 2026-06-12, not a live source. The ecosystem moves monthly — re-check these before they drive a real decision: