10  GPU Inference

Note

Self-hosted vLLM across a heterogeneous GPU fleet — mixed VRAM, mixed vendors, Envoy load balancing.

10.1 The GPU fleet

10.2 vLLM deployment

10.3 Envoy L7 load balancing

10.4 Authentication for the OpenAI API

10.5 ROCm and CUDA coexistence