Self-hosted Real hardware Work in progress

Notes from building and operating an inference platform.

What I’m learning running GPU inference on my own hardware — what breaks, why, and what I did about it.

What I operate

Self-funded inference platform in a commercial office space. k3s cluster with GPU workers, Arista/Nexus switching, Prometheus + Grafana, and a model router at a Virginia edge node. I built it to understand the failure modes firsthand.

How I debug

Decompose the problem: queue vs compute, then prefill vs decode. Validate the story with metrics/logs, isolate the layer, then ship a guardrail so it doesn’t recur.

queue_wait TTFT tokens/sec KV cache placement tail latency

Notes

Back