AI Inference Saturation Analyzer
rack2cloud — AI Infrastructure
Runtime Saturation Analysis
AI Infrastructure — Inference Runtime Toolkit

Inference Saturation
Analyzer

GPU utilization is not interactive capacity. This tool surfaces the concurrency knee where latency stops scaling and starts amplifying — the Interaction Collapse Point — under your explicit serving assumptions.

Transparent parametric model. Assumptions exposed.
Not a benchmark. Not a forecaster.
Runs entirely in your browser.
Layer 01
Prefill — TTFT
Layer 02
Decode — TPOT
Layer 03
Queue Amplification
Layer 04
KV-Cache Ceiling
01 Model & Hardware Profile
Drives parameter footprint and KV-cache bytes-per-token. MoE uses active-param decode cost; MLA models carry a smaller KV footprint.
Weight quantization frees VRAM for KV cache — raising the concurrency ceiling.
Maximum addressable context — caps per-sequence KV residency.
Sets VRAM budget and memory bandwidth — the decode roofline.
GPUs dedicated to this model's serving deployment.
02 Serving Architecture
The single largest determinant of the decode saturation curve. Static batching saturates far earlier.
GPUs per replica. Replica count = GPU count ÷ TP. Higher TP adds interconnect overhead at diminishing return.
Paged attention reclaims fragmentation — more usable VRAM for concurrent sequences.
Lifts decode throughput at low-to-moderate concurrency; benefit tapers as the batch fills.
03 Traffic Shape
Mean prompt length — drives prefill compute and TTFT floor.
Mean generation length — drives KV growth and total decode time.
180
The concurrent interactive sessions you intend this tier to support. This is the number the analyzer tests against reality.
04 Operational Objectives (SLO)
Acceptable time-to-first-token for interactive use.
Acceptable inter-token latency. ~50ms ≈ comfortable reading pace.
End-to-end p95 ceiling: TTFT plus full generation.

Block A — Recognition
Block B — Degradation Curves
Block C — Saturation Explanation
Block D — Economics & Routing

Architecture Review

The analyzer surfaces the collapse point under stated assumptions. A structured review maps it to your real traffic distribution, replica topology, and SLO commitments — and identifies the serving-architecture changes that move the knee without adding GPUs.

Work With The Architect →