202510272132 Homelab Cluster Drain Playbook
Today I had an unplanned outage that was gracefully saved by my APC UPS. I wanted to gracefully bring down my nodes and other servers. The great part is that I did all of this within 20 minutes (perhaps even less!)
Therefore, I am writing this playbook for my future self. The goal of the playbook is simple
- An easy to follow set of guides to bring down the cluster and services in the event of an outage
- The guide need not be followed entirely.
Playbook
1. Suspend Flux Reconciliation
This is the most vital step. As the cluster is reconciled via Flux against a Git artifact, it will always try to reconcile any configuration drift. As we will scale some deployments to 0, we want to stop fluxcd from interfering.
# Run in any orderflux suspend kustomization appsflux suspend kustomization infrastructureflux suspend kustomization flux-system
flux suspend hr --all -n homelabflux suspend hr --all -n traefikflux suspend hr --all -n tailscaleflux suspend hr --all -n flux-systemflux suspend hr --all -n infisicalflux suspend hr --all -n longhorn-system2. Scale all services and apps to 0
Firstly, scale all applications in these namespaces
flux=systemhomelabinfisicallonghorn-systemtailscaletraefik
Furthermore, the longhorn documentation suggests the additional steps to detach the longhorn volumes.
The goal here is to gracefully scale all applications to 0 and prevent the initialization and provision of applications and services to the nodes.
3. Drain the worker nodes
At this point, no applications and services should be provisioned. If step (1) was done right, flux would also not attempt to reconcile.
After confirming, drain the nodes.
4. Gracefully shutdown the nodes
SSH into the nodes and run the shutdown command
5. Shutdown the control planes
After shutting down the worker nodes, we can shut down the control planes. As the control planes have been tainted to not host any apps or services except for core services, it should be easy enough to power down the nodes.