16 Day Two Operations
Note
What happens after everything is deployed — updates, adding services, adding users, and what breaks.
16.1 Update strategy
- Host OS:
bootc upgradepulls the latest instance image from the registry; reboot applies it. Rollback isbootc rollbackto the previous image. - Containers:
podman auto-updatechecks for new:latestdigests and restarts containers that changed. ZFS snapshot before the update provides rollback. - Base images: CI rebuilds nightly; instance images rebuild when the base changes. The host pulls the new image on next
bootc upgrade. - Cadence: host updates are manual (reboot required); container updates are automatic and continuous.
16.2 Adding a new service
- Create a service user with a static UID in the instance Containerfile
- Create a ZFS dataset for the service
- Add quadlet files (
.pod,.container,Caddyfile) toquadlets/{service}/ - Add the service definition to
pyinfra/services.py(user, UID, FQDN, port, secrets, volumes) - Add an SNI filter chain to Envoy’s config for the new FQDN
- Create an Authentik application and provider (OIDC, LDAP, or forward auth)
- Create a Cloudflare DNS record for the FQDN
make deploy-{service}
16.3 Adding a new user
- Create user in Authentik, send enrollment link
- User sets password, registers passkey
- Access to services controlled by Authentik group membership and per-application policies
- No per-service account creation needed for OIDC-integrated services (accounts auto-provisioned on first login)
16.4 Decommissioning a service
- Stop the pod:
systemctl --user stop {pod} - Remove quadlet files, remove service from
pyinfra/services.py, remove Envoy SNI entry - ZFS dataset preserved (snapshots retain the data); destroy it explicitly when confident
- Remove Authentik application and DNS record
16.5 What breaks and why
- Authentik down: every service behind forward auth or OIDC becomes inaccessible. This is the single point of failure by design; Authentik reliability is critical.
- Envoy down: all external traffic stops. Restart is automatic (systemd); recovery is fast.
- ZFS pool degraded: service continues on remaining mirror vdev. Replace the failed drive, resilver. zrepl alerts on replication failures.
- DNS/Cloudflare outage: external access fails. Internal access via
/etc/hostsentries still works for services that need to reach each other. - Power loss: bootc boots to last-known image, ZFS imports pools, systemd starts all lingering user services. Designed for unattended recovery.
16.6 Capacity planning
- Monitor via Grafana: disk usage per ZFS dataset, CPU/memory per pod, GPU utilization
- ZFS datasets with quotas prevent one service from consuming all disk
- Adding storage: new drives added to existing pools (
zpool add) or new pools for new workload classes - Adding compute: the single-host model has a ceiling; multi-host would require rethinking Envoy ingress and service discovery (not yet needed)