9 Observability
Note
Grafana, Loki, Prometheus, Alloy. journald as the log source. All rootful, all on-host.
9.1 The observability stack
- Grafana for dashboards and alerting
- Loki for log aggregation
- Prometheus for metrics
- Grafana Alloy as the unified collection agent (replaces Promtail + node_exporter)
- All four run on the host; Grafana/Loki/Prometheus in a rootless pod, Alloy as a rootful container (needs access to journald and all service users’ container logs)
9.2 Alloy as the collection agent
- Alloy collects from two log sources: systemd journal (host and service events) and Podman container logs (overlay storage
ctr.logfiles) - Pushes logs to Loki on localhost:3100
- Scrapes Prometheus metrics from: node_exporter (host), Authentik, Envoy admin (:9901), zrepl
- Runs with host network and bind mounts for
/var/log/journal,/home(service user container storage)
9.3 Why journald is the only log source
- Every rootless Podman container logs to journald via systemd
- Alloy also reads Podman’s internal container logs for richer metadata
- No syslog, no file-based logs, no log rotation config — journald handles retention and structure
9.4 Prometheus metrics
- Static scrape targets defined in
prometheus.yml, deployed by pyinfra - Authentik exposes metrics on :9300 (application health, flow execution, policy evaluation)
- Envoy exposes metrics on :9901 (connection counts, latency histograms, circuit breaker state)
- zrepl exposes metrics on :9811 (snapshot counts, replication lag)
9.5 Grafana dashboards
- OIDC login via Authentik (Grafana is one of the OIDC-integrated services)
- Dashboards for: host resource usage, per-service container health, Envoy traffic, ZFS pool state
- Alerting via Grafana’s built-in alertmanager (email, webhook)
9.6 LLM-assisted triage
- Journal logs and Grafana alerts can be fed to the self-hosted vLLM instance for summarization
- Useful for quickly understanding multi-service failures from a single log dump
- Experimental; the integration is manual today
9.7 Uptime monitoring (break-glass)
- External uptime monitoring from outside the network (e.g., Uptime Kuma or a SaaS provider)
- If the host is down, on-host observability is useless; external checks are the break-glass path