16 Day Two Operations

Note

What happens after everything is deployed — updates, adding services, adding users, and what breaks.

16.1 Update strategy

Host OS: bootc upgrade pulls the latest instance image from the registry; reboot applies it. Rollback is bootc rollback to the previous image.
Containers: podman auto-update checks for new :latest digests and restarts containers that changed. ZFS snapshot before the update provides rollback.
Base images: CI rebuilds nightly; instance images rebuild when the base changes. The host pulls the new image on next bootc upgrade.
Cadence: host updates are manual (reboot required); container updates are automatic and continuous.

16.2 Adding a new service

Create a service user with a static UID in the instance Containerfile
Create a ZFS dataset for the service
Add quadlet files (.pod, .container, Caddyfile) to quadlets/{service}/
Add the service definition to pyinfra/services.py (user, UID, FQDN, port, secrets, volumes)
Add an SNI filter chain to Envoy’s config for the new FQDN
Create an Authentik application and provider (OIDC, LDAP, or forward auth)
Create a Cloudflare DNS record for the FQDN
make deploy-{service}

16.3 Adding a new user

Create user in Authentik, send enrollment link
User sets password, registers passkey
Access to services controlled by Authentik group membership and per-application policies
No per-service account creation needed for OIDC-integrated services (accounts auto-provisioned on first login)

16.4 Decommissioning a service

Stop the pod: systemctl --user stop {pod}
Remove quadlet files, remove service from pyinfra/services.py, remove Envoy SNI entry
ZFS dataset preserved (snapshots retain the data); destroy it explicitly when confident
Remove Authentik application and DNS record

16.5 What breaks and why

Authentik down: every service behind forward auth or OIDC becomes inaccessible. This is the single point of failure by design; Authentik reliability is critical.
Envoy down: all external traffic stops. Restart is automatic (systemd); recovery is fast.
ZFS pool degraded: service continues on remaining mirror vdev. Replace the failed drive, resilver. zrepl alerts on replication failures.
DNS/Cloudflare outage: external access fails. Internal access via /etc/hosts entries still works for services that need to reach each other.
Power loss: bootc boots to last-known image, ZFS imports pools, systemd starts all lingering user services. Designed for unattended recovery.

16.6 Capacity planning

Monitor via Grafana: disk usage per ZFS dataset, CPU/memory per pod, GPU utilization
ZFS datasets with quotas prevent one service from consuming all disk
Adding storage: new drives added to existing pools (zpool add) or new pools for new workload classes
Adding compute: the single-host model has a ceiling; multi-host would require rethinking Envoy ingress and service discovery (not yet needed)

# Day Two Operations ::: {.callout-note} What happens after everything is deployed — updates, adding services, adding users, and what breaks. ::: ## Update strategy - **Host OS**: `bootc upgrade` pulls the latest instance image from the registry; reboot applies it. Rollback is `bootc rollback` to the previous image. - **Containers**: `podman auto-update` checks for new `:latest` digests and restarts containers that changed. ZFS snapshot before the update provides rollback. - **Base images**: CI rebuilds nightly; instance images rebuild when the base changes. The host pulls the new image on next `bootc upgrade`. - Cadence: host updates are manual (reboot required); container updates are automatic and continuous. ## Adding a new service 1. Create a service user with a static UID in the instance Containerfile 2. Create a ZFS dataset for the service 3. Add quadlet files (`.pod`, `.container`, `Caddyfile`) to `quadlets/{service}/` 4. Add the service definition to `pyinfra/services.py` (user, UID, FQDN, port, secrets, volumes) 5. Add an SNI filter chain to Envoy's config for the new FQDN 6. Create an Authentik application and provider (OIDC, LDAP, or forward auth) 7. Create a Cloudflare DNS record for the FQDN 8. `make deploy-{service}` ## Adding a new user - Create user in Authentik, send enrollment link - User sets password, registers passkey - Access to services controlled by Authentik group membership and per-application policies - No per-service account creation needed for OIDC-integrated services (accounts auto-provisioned on first login) ## Decommissioning a service - Stop the pod: `systemctl --user stop {pod}` - Remove quadlet files, remove service from `pyinfra/services.py`, remove Envoy SNI entry - ZFS dataset preserved (snapshots retain the data); destroy it explicitly when confident - Remove Authentik application and DNS record ## What breaks and why - **Authentik down**: every service behind forward auth or OIDC becomes inaccessible. This is the single point of failure by design; Authentik reliability is critical. - **Envoy down**: all external traffic stops. Restart is automatic (systemd); recovery is fast. - **ZFS pool degraded**: service continues on remaining mirror vdev. Replace the failed drive, resilver. zrepl alerts on replication failures. - **DNS/Cloudflare outage**: external access fails. Internal access via `/etc/hosts` entries still works for services that need to reach each other. - **Power loss**: bootc boots to last-known image, ZFS imports pools, systemd starts all lingering user services. Designed for unattended recovery. ## Capacity planning - Monitor via Grafana: disk usage per ZFS dataset, CPU/memory per pod, GPU utilization - ZFS datasets with quotas prevent one service from consuming all disk - Adding storage: new drives added to existing pools (`zpool add`) or new pools for new workload classes - Adding compute: the single-host model has a ceiling; multi-host would require rethinking Envoy ingress and service discovery (not yet needed)