9 Observability

Note

Grafana, Loki, Prometheus, Alloy. journald as the log source. All rootful, all on-host.

9.1 The observability stack

Grafana for dashboards and alerting
Loki for log aggregation
Prometheus for metrics
Grafana Alloy as the unified collection agent (replaces Promtail + node_exporter)
All four run on the host; Grafana/Loki/Prometheus in a rootless pod, Alloy as a rootful container (needs access to journald and all service users’ container logs)

9.2 Alloy as the collection agent

Alloy collects from two log sources: systemd journal (host and service events) and Podman container logs (overlay storage ctr.log files)
Pushes logs to Loki on localhost:3100
Scrapes Prometheus metrics from: node_exporter (host), Authentik, Envoy admin (:9901), zrepl
Runs with host network and bind mounts for /var/log/journal, /home (service user container storage)

9.3 Why journald is the only log source

Every rootless Podman container logs to journald via systemd
Alloy also reads Podman’s internal container logs for richer metadata
No syslog, no file-based logs, no log rotation config — journald handles retention and structure

9.4 Prometheus metrics

Static scrape targets defined in prometheus.yml, deployed by pyinfra
Authentik exposes metrics on :9300 (application health, flow execution, policy evaluation)
Envoy exposes metrics on :9901 (connection counts, latency histograms, circuit breaker state)
zrepl exposes metrics on :9811 (snapshot counts, replication lag)

9.5 Grafana dashboards

OIDC login via Authentik (Grafana is one of the OIDC-integrated services)
Dashboards for: host resource usage, per-service container health, Envoy traffic, ZFS pool state
Alerting via Grafana’s built-in alertmanager (email, webhook)

9.6 LLM-assisted triage

Journal logs and Grafana alerts can be fed to the self-hosted vLLM instance for summarization
Useful for quickly understanding multi-service failures from a single log dump
Experimental; the integration is manual today

9.7 Uptime monitoring (break-glass)

External uptime monitoring from outside the network (e.g., Uptime Kuma or a SaaS provider)
If the host is down, on-host observability is useless; external checks are the break-glass path

# Observability

::: {.callout-note}
Grafana, Loki, Prometheus, Alloy. journald as the log source. All rootful, all on-host.
:::

## The observability stack

- [Grafana](https://grafana.com/) for dashboards and alerting
- [Loki](https://grafana.com/oss/loki/) for log aggregation
- [Prometheus](https://prometheus.io/) for metrics
- [Grafana Alloy](https://grafana.com/docs/alloy/) as the unified collection agent (replaces Promtail + node_exporter)
- All four run on the host; Grafana/Loki/Prometheus in a rootless pod, Alloy as a rootful container (needs access to journald and all service users' container logs)

## Alloy as the collection agent

- Alloy collects from two log sources: systemd journal (host and service events) and Podman container logs (overlay storage `ctr.log` files)
- Pushes logs to Loki on localhost:3100
- Scrapes Prometheus metrics from: node_exporter (host), Authentik, Envoy admin (:9901), zrepl
- Runs with host network and bind mounts for `/var/log/journal`, `/home` (service user container storage)

## Why journald is the only log source

- Every rootless Podman container logs to journald via systemd
- Alloy also reads Podman's internal container logs for richer metadata
- No syslog, no file-based logs, no log rotation config — journald handles retention and structure

## Prometheus metrics

- Static scrape targets defined in `prometheus.yml`, deployed by pyinfra
- Authentik exposes metrics on :9300 (application health, flow execution, policy evaluation)
- Envoy exposes metrics on :9901 (connection counts, latency histograms, circuit breaker state)
- zrepl exposes metrics on :9811 (snapshot counts, replication lag)

## Grafana dashboards

- OIDC login via Authentik (Grafana is one of the OIDC-integrated services)
- Dashboards for: host resource usage, per-service container health, Envoy traffic, ZFS pool state
- Alerting via Grafana's built-in alertmanager (email, webhook)

## LLM-assisted triage

- Journal logs and Grafana alerts can be fed to the self-hosted vLLM instance for summarization
- Useful for quickly understanding multi-service failures from a single log dump
- Experimental; the integration is manual today

## Uptime monitoring (break-glass)

- External uptime monitoring from outside the network (e.g., [Uptime Kuma](https://github.com/louislam/uptime-kuma) or a SaaS provider)
- If the host is down, on-host observability is useless; external checks are the break-glass path