Self-hosted
Real hardware
Work in progress
Notes from building and operating an inference platform.
What I’m learning running GPU inference on my own hardware — what breaks, why, and what I did about it.
What I operate
Self-funded inference platform in a commercial office space. k3s cluster with GPU workers, Arista/Nexus switching, Prometheus + Grafana, and a model router at a Virginia edge node. I built it to understand the failure modes firsthand.
How I debug
Decompose the problem: queue vs compute, then prefill vs decode.
Validate the story with metrics/logs, isolate the layer, then ship a guardrail so it doesn’t recur.
queue_wait
TTFT
tokens/sec
KV cache
placement
tail latency
Notes
/