16  Day Two Operations

Note

What happens after everything is deployed — updates, adding services, adding users, and what breaks.

16.1 Update strategy

  • Host OS: bootc upgrade pulls the latest instance image from the registry; reboot applies it. Rollback is bootc rollback to the previous image.
  • Containers: podman auto-update checks for new :latest digests and restarts containers that changed. ZFS snapshot before the update provides rollback.
  • Base images: CI rebuilds nightly; instance images rebuild when the base changes. The host pulls the new image on next bootc upgrade.
  • Cadence: host updates are manual (reboot required); container updates are automatic and continuous.

16.2 Adding a new service

  1. Create a service user with a static UID in the instance Containerfile
  2. Create a ZFS dataset for the service
  3. Add quadlet files (.pod, .container, Caddyfile) to quadlets/{service}/
  4. Add the service definition to pyinfra/services.py (user, UID, FQDN, port, secrets, volumes)
  5. Add an SNI filter chain to Envoy’s config for the new FQDN
  6. Create an Authentik application and provider (OIDC, LDAP, or forward auth)
  7. Create a Cloudflare DNS record for the FQDN
  8. make deploy-{service}

16.3 Adding a new user

  • Create user in Authentik, send enrollment link
  • User sets password, registers passkey
  • Access to services controlled by Authentik group membership and per-application policies
  • No per-service account creation needed for OIDC-integrated services (accounts auto-provisioned on first login)

16.4 Decommissioning a service

  • Stop the pod: systemctl --user stop {pod}
  • Remove quadlet files, remove service from pyinfra/services.py, remove Envoy SNI entry
  • ZFS dataset preserved (snapshots retain the data); destroy it explicitly when confident
  • Remove Authentik application and DNS record

16.5 What breaks and why

  • Authentik down: every service behind forward auth or OIDC becomes inaccessible. This is the single point of failure by design; Authentik reliability is critical.
  • Envoy down: all external traffic stops. Restart is automatic (systemd); recovery is fast.
  • ZFS pool degraded: service continues on remaining mirror vdev. Replace the failed drive, resilver. zrepl alerts on replication failures.
  • DNS/Cloudflare outage: external access fails. Internal access via /etc/hosts entries still works for services that need to reach each other.
  • Power loss: bootc boots to last-known image, ZFS imports pools, systemd starts all lingering user services. Designed for unattended recovery.

16.6 Capacity planning

  • Monitor via Grafana: disk usage per ZFS dataset, CPU/memory per pod, GPU utilization
  • ZFS datasets with quotas prevent one service from consuming all disk
  • Adding storage: new drives added to existing pools (zpool add) or new pools for new workload classes
  • Adding compute: the single-host model has a ceiling; multi-host would require rethinking Envoy ingress and service discovery (not yet needed)