Skip to content

202510272132 Homelab Cluster Drain Playbook

Today I had an unplanned outage that was gracefully saved by my APC UPS. I wanted to gracefully bring down my nodes and other servers. The great part is that I did all of this within 20 minutes (perhaps even less!)

Therefore, I am writing this playbook for my future self. The goal of the playbook is simple

  1. An easy to follow set of guides to bring down the cluster and services in the event of an outage
  2. The guide need not be followed entirely.

Playbook

1. Suspend Flux Reconciliation

This is the most vital step. As the cluster is reconciled via Flux against a Git artifact, it will always try to reconcile any configuration drift. As we will scale some deployments to 0, we want to stop fluxcd from interfering.

Terminal window
# Run in any order
flux suspend kustomization apps
flux suspend kustomization infrastructure
flux suspend kustomization flux-system
flux suspend hr --all -n homelab
flux suspend hr --all -n traefik
flux suspend hr --all -n tailscale
flux suspend hr --all -n flux-system
flux suspend hr --all -n infisical
flux suspend hr --all -n longhorn-system

2. Scale all services and apps to 0

Firstly, scale all applications in these namespaces

  1. flux=system
  2. homelab
  3. infisical
  4. longhorn-system
  5. tailscale
  6. traefik

Furthermore, the longhorn documentation suggests the additional steps to detach the longhorn volumes.

The goal here is to gracefully scale all applications to 0 and prevent the initialization and provision of applications and services to the nodes.

3. Drain the worker nodes

At this point, no applications and services should be provisioned. If step (1) was done right, flux would also not attempt to reconcile.

After confirming, drain the nodes.

4. Gracefully shutdown the nodes

SSH into the nodes and run the shutdown command

5. Shutdown the control planes

After shutting down the worker nodes, we can shut down the control planes. As the control planes have been tainted to not host any apps or services except for core services, it should be easy enough to power down the nodes.