10 GPU Inference

Note

Self-hosted vLLM on consumer GPUs. CDI for rootless GPU access. Envoy for L7 load balancing.

10.1 Why self-host inference

API costs scale with usage; a local GPU is a fixed cost after purchase
Privacy: prompts and completions stay on-premises
Useful for the whole family: summarization, writing assistance, homework help, code generation
The OpenAI-compatible API is the interface; clients don’t know or care where inference runs

10.2 The GPU fleet

Consumer GPUs: mixed NVIDIA cards with varying VRAM
NVIDIA open kernel modules compiled at image build time (tier 2 base image)
NVIDIA Container Toolkit provides CDI specs generated on first boot via a custom systemd service
CDI lets rootless Podman access GPUs without --privileged or device bind mounts

10.3 vLLM deployment

vLLM runs as a rootless pod: vllm-server + Caddy sidecar
Quadlet uses --device=nvidia.com/gpu=all to pass GPUs via CDI
Model weights stored on ZFS dataset /zfs/safe/generative
Pod publishes 127.0.0.1:8450→443

10.4 Envoy L7 load balancing

For inference, Envoy does L7 (not L4 SNI passthrough like other services)
STRICT_DNS cluster with health checks against vLLM’s /health endpoint
Circuit breakers prevent overloading a single instance
This is the path to multi-GPU or multi-host inference: add more vLLM instances, Envoy distributes requests

10.5 Authentication for the OpenAI API

Caddy sidecar handles TLS; forward auth via Authentik gates access
API clients authenticate with an Authentik API token or browser session
No vLLM-level authentication; the proxy layer handles it entirely

10.6 ROCm and CUDA coexistence

AMD GPUs via ROCm are possible but currently blocked: amdgpu-dkms fails to compile against kernel 6.12+ (ROCm#5111)
The image build system supports both; base/centos/amd/ exists but is disabled pending the upstream fix
When ROCm works again, mixed AMD+NVIDIA inference is architecturally straightforward: separate vLLM instances per GPU vendor, Envoy load-balances across both

# GPU Inference

::: {.callout-note}
Self-hosted [vLLM](https://docs.vllm.ai/) on consumer GPUs. CDI for rootless GPU access. Envoy for L7 load balancing.
:::

## Why self-host inference

- API costs scale with usage; a local GPU is a fixed cost after purchase
- Privacy: prompts and completions stay on-premises
- Useful for the whole family: summarization, writing assistance, homework help, code generation
- The [OpenAI-compatible API](https://platform.openai.com/docs/api-reference) is the interface; clients don't know or care where inference runs

## The GPU fleet

- Consumer GPUs: mixed NVIDIA cards with varying VRAM
- [NVIDIA open kernel modules](https://github.com/NVIDIA/open-gpu-kernel-modules) compiled at image build time (tier 2 base image)
- [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-container-toolkit) provides [CDI](https://github.com/cncf-tags/container-device-interface) specs generated on first boot via a custom systemd service
- CDI lets rootless Podman access GPUs without `--privileged` or device bind mounts

## vLLM deployment

- [vLLM](https://docs.vllm.ai/) runs as a rootless pod: vllm-server + Caddy sidecar
- Quadlet uses `--device=nvidia.com/gpu=all` to pass GPUs via CDI
- Model weights stored on ZFS dataset `/zfs/safe/generative`
- Pod publishes `127.0.0.1:8450→443`

## Envoy L7 load balancing

- For inference, Envoy does L7 (not L4 SNI passthrough like other services)
- STRICT_DNS cluster with health checks against vLLM's `/health` endpoint
- Circuit breakers prevent overloading a single instance
- This is the path to multi-GPU or multi-host inference: add more vLLM instances, Envoy distributes requests

## Authentication for the OpenAI API

- Caddy sidecar handles TLS; forward auth via Authentik gates access
- API clients authenticate with an Authentik API token or browser session
- No vLLM-level authentication; the proxy layer handles it entirely

## ROCm and CUDA coexistence

- AMD GPUs via [ROCm](https://rocm.docs.amd.com/) are possible but currently blocked: `amdgpu-dkms` fails to compile against kernel 6.12+ ([ROCm#5111](https://github.com/ROCm/ROCm/issues/5111))
- The image build system supports both; `base/centos/amd/` exists but is disabled pending the upstream fix
- When ROCm works again, mixed AMD+NVIDIA inference is architecturally straightforward: separate vLLM instances per GPU vendor, Envoy load-balances across both