10 GPU Inference
Note
Self-hosted vLLM on consumer GPUs. CDI for rootless GPU access. Envoy for L7 load balancing.
10.1 Why self-host inference
- API costs scale with usage; a local GPU is a fixed cost after purchase
- Privacy: prompts and completions stay on-premises
- Useful for the whole family: summarization, writing assistance, homework help, code generation
- The OpenAI-compatible API is the interface; clients don’t know or care where inference runs
10.2 The GPU fleet
- Consumer GPUs: mixed NVIDIA cards with varying VRAM
- NVIDIA open kernel modules compiled at image build time (tier 2 base image)
- NVIDIA Container Toolkit provides CDI specs generated on first boot via a custom systemd service
- CDI lets rootless Podman access GPUs without
--privilegedor device bind mounts
10.3 vLLM deployment
- vLLM runs as a rootless pod: vllm-server + Caddy sidecar
- Quadlet uses
--device=nvidia.com/gpu=allto pass GPUs via CDI - Model weights stored on ZFS dataset
/zfs/safe/generative - Pod publishes
127.0.0.1:8450→443
10.4 Envoy L7 load balancing
- For inference, Envoy does L7 (not L4 SNI passthrough like other services)
- STRICT_DNS cluster with health checks against vLLM’s
/healthendpoint - Circuit breakers prevent overloading a single instance
- This is the path to multi-GPU or multi-host inference: add more vLLM instances, Envoy distributes requests
10.5 Authentication for the OpenAI API
- Caddy sidecar handles TLS; forward auth via Authentik gates access
- API clients authenticate with an Authentik API token or browser session
- No vLLM-level authentication; the proxy layer handles it entirely
10.6 ROCm and CUDA coexistence
- AMD GPUs via ROCm are possible but currently blocked:
amdgpu-dkmsfails to compile against kernel 6.12+ (ROCm#5111) - The image build system supports both;
base/centos/amd/exists but is disabled pending the upstream fix - When ROCm works again, mixed AMD+NVIDIA inference is architecturally straightforward: separate vLLM instances per GPU vendor, Envoy load-balances across both