10  GPU Inference

Note

Self-hosted vLLM on consumer GPUs. CDI for rootless GPU access. Envoy for L7 load balancing.

10.1 Why self-host inference

  • API costs scale with usage; a local GPU is a fixed cost after purchase
  • Privacy: prompts and completions stay on-premises
  • Useful for the whole family: summarization, writing assistance, homework help, code generation
  • The OpenAI-compatible API is the interface; clients don’t know or care where inference runs

10.2 The GPU fleet

  • Consumer GPUs: mixed NVIDIA cards with varying VRAM
  • NVIDIA open kernel modules compiled at image build time (tier 2 base image)
  • NVIDIA Container Toolkit provides CDI specs generated on first boot via a custom systemd service
  • CDI lets rootless Podman access GPUs without --privileged or device bind mounts

10.3 vLLM deployment

  • vLLM runs as a rootless pod: vllm-server + Caddy sidecar
  • Quadlet uses --device=nvidia.com/gpu=all to pass GPUs via CDI
  • Model weights stored on ZFS dataset /zfs/safe/generative
  • Pod publishes 127.0.0.1:8450→443

10.4 Envoy L7 load balancing

  • For inference, Envoy does L7 (not L4 SNI passthrough like other services)
  • STRICT_DNS cluster with health checks against vLLM’s /health endpoint
  • Circuit breakers prevent overloading a single instance
  • This is the path to multi-GPU or multi-host inference: add more vLLM instances, Envoy distributes requests

10.5 Authentication for the OpenAI API

  • Caddy sidecar handles TLS; forward auth via Authentik gates access
  • API clients authenticate with an Authentik API token or browser session
  • No vLLM-level authentication; the proxy layer handles it entirely

10.6 ROCm and CUDA coexistence

  • AMD GPUs via ROCm are possible but currently blocked: amdgpu-dkms fails to compile against kernel 6.12+ (ROCm#5111)
  • The image build system supports both; base/centos/amd/ exists but is disabled pending the upstream fix
  • When ROCm works again, mixed AMD+NVIDIA inference is architecturally straightforward: separate vLLM instances per GPU vendor, Envoy load-balances across both