9  Observability

Note

Grafana, Loki, Prometheus, Alloy. journald as the log source. All rootful, all on-host.

9.1 The observability stack

  • Grafana for dashboards and alerting
  • Loki for log aggregation
  • Prometheus for metrics
  • Grafana Alloy as the unified collection agent (replaces Promtail + node_exporter)
  • All four run on the host; Grafana/Loki/Prometheus in a rootless pod, Alloy as a rootful container (needs access to journald and all service users’ container logs)

9.2 Alloy as the collection agent

  • Alloy collects from two log sources: systemd journal (host and service events) and Podman container logs (overlay storage ctr.log files)
  • Pushes logs to Loki on localhost:3100
  • Scrapes Prometheus metrics from: node_exporter (host), Authentik, Envoy admin (:9901), zrepl
  • Runs with host network and bind mounts for /var/log/journal, /home (service user container storage)

9.3 Why journald is the only log source

  • Every rootless Podman container logs to journald via systemd
  • Alloy also reads Podman’s internal container logs for richer metadata
  • No syslog, no file-based logs, no log rotation config — journald handles retention and structure

9.4 Prometheus metrics

  • Static scrape targets defined in prometheus.yml, deployed by pyinfra
  • Authentik exposes metrics on :9300 (application health, flow execution, policy evaluation)
  • Envoy exposes metrics on :9901 (connection counts, latency histograms, circuit breaker state)
  • zrepl exposes metrics on :9811 (snapshot counts, replication lag)

9.5 Grafana dashboards

  • OIDC login via Authentik (Grafana is one of the OIDC-integrated services)
  • Dashboards for: host resource usage, per-service container health, Envoy traffic, ZFS pool state
  • Alerting via Grafana’s built-in alertmanager (email, webhook)

9.6 LLM-assisted triage

  • Journal logs and Grafana alerts can be fed to the self-hosted vLLM instance for summarization
  • Useful for quickly understanding multi-service failures from a single log dump
  • Experimental; the integration is manual today

9.7 Uptime monitoring (break-glass)

  • External uptime monitoring from outside the network (e.g., Uptime Kuma or a SaaS provider)
  • If the host is down, on-host observability is useless; external checks are the break-glass path