AI Inference Saturation Analyzer

01 Model & Hardware Profile

▶

Model Family

Drives parameter footprint and KV-cache bytes-per-token. MoE uses active-param decode cost; MLA models carry a smaller KV footprint.

Quantization

Weight quantization frees VRAM for KV cache — raising the concurrency ceiling.

Context Window (tokens)

Maximum addressable context — caps per-sequence KV residency.

GPU Type

Sets VRAM budget and memory bandwidth — the decode roofline.

Total GPU Count (this serving tier)

GPUs dedicated to this model's serving deployment.

02 Serving Architecture

▶

Serving Stack

Batching Policy

The single largest determinant of the decode saturation curve. Static batching saturates far earlier.

Tensor Parallelism Degree

GPUs per replica. Replica count = GPU count ÷ TP. Higher TP adds interconnect overhead at diminishing return.

KV Cache Policy

Paged attention reclaims fragmentation — more usable VRAM for concurrent sequences.

Speculative Decoding

Lifts decode throughput at low-to-moderate concurrency; benefit tapers as the batch fills.

03 Traffic Shape

▶

Avg Input Tokens (prefill load)

Mean prompt length — drives prefill compute and TTFT floor.

Avg Output Tokens (decode load)

Mean generation length — drives KV growth and total decode time.

Peak Concurrency Target (advertised)

180

The concurrent interactive sessions you intend this tier to support. This is the number the analyzer tests against reality.

Traffic Variance

Burst Behavior

04 Operational Objectives (SLO)

▶

TTFT Target (seconds)

Acceptable time-to-first-token for interactive use.

TPOT Target (ms / token)

Acceptable inter-token latency. ~50ms ≈ comfortable reading pace.

Max Acceptable p95 Latency (seconds)

End-to-end p95 ceiling: TTFT plus full generation.

Priority Mode

Inference Saturation Analysis

Block A — Recognition

Phantom Throughput Detection

Throughput Illusion Index

Block B — Degradation Curves

TTFT & TPOT vs Concurrency

Block C — Saturation Explanation

First Saturation Driver

Queue Amplification Signal

Concurrency Elasticity Signal

Block D — Economics & Routing

Persistent Inference Density Signal

Placement Pressure Signal

GPU Yield Dependency

Architecture Review

The analyzer surfaces the collapse point under stated assumptions. A structured review maps it to your real traffic distribution, replica topology, and SLO commitments — and identifies the serving-architecture changes that move the knee without adding GPUs.

Work With The Architect →

Inference SaturationAnalyzer

Architecture Review

Inference Saturation
Analyzer