Runbook-first
Evidence over opinions
No fluff
Operator notes for making inference boring again.
Practical reliability writing: what breaks, how to triage fast, and how to leave behind guardrails. If it can’t survive a 2 AM incident bridge, it doesn’t belong here.
What I operate
A self-hosted inference platform (Kubernetes + GPUs + high-speed network + observability) and production responsibilities. Built to be measurable and debuggable — not “demo-ready.”
How I debug
Decompose the problem: queue vs compute, then prefill vs decode.
Validate the story with metrics/logs, isolate the layer, then ship a guardrail so it doesn’t recur.
queue_wait
TTFT
tokens/sec
KV cache
placement
tail latency
Notes
/