My Script Deleted Staging — Flux Brought It Back in 10 Minutes

On a quiet Sunday morning, I ran a teardown script for a test cluster. It deleted resources from the staging cluster instead. Here’s the full story — how it happened, why GitOps saved us, and what I changed so it can never happen again.

What Happened

I was working on an integration cluster (Hetzner bare-metal) and had a staging cluster context loaded from earlier analysis work. The integration repo had a teardown.sh script to clean up test resources:

#!/bin/bash
kubectl delete -k core/
kubectl delete -k platform/
kubectl delete -k apps/

The script ran. Against staging. Because teardown.sh was the only script in the repo that didn’t set KUBECONFIG.

Every other script had:

export KUBECONFIG="${KUBECONFIG:-$REPO_ROOT/kubeconfig.yaml}"

This one didn’t. It used whatever kubectl context was active — which was the staging cluster I’d switched to earlier with kubie ctx.

The Damage

The kubectl delete -k removed Flux Kustomization resources, HelmReleases, and other managed objects from the staging cluster. Services went down. The team found out Monday morning via GKE audit logs.

The Recovery

Here’s the thing about GitOps: the cluster state is declared in git. Flux’s reconciliation loop noticed that resources were missing and recreated them from the fluxor-staging repository. Within about 10 minutes, staging was back to its declared state.

No manual intervention. No “which YAML did we apply last?” No restore-from-backup. Flux just did its job — if the git repo says a resource should exist, it exists.

The only things that didn’t auto-recover were stateful resources (PVCs, databases) — but those weren’t deleted because kubectl delete -k only removes what the kustomization defines, and our kustomizations reference configs and deployments, not persistent volumes.

Why It Happened

One missing line in one script. The mental model was “I’m working on the integration cluster” but the shell context said otherwise. kubectl defaults are dangerous when you work across multiple clusters daily.

The Fixes

1. KUBECONFIG guard in every script

export KUBECONFIG="${KUBECONFIG:-$REPO_ROOT/kubeconfig.yaml}"
if [ ! -f "$KUBECONFIG" ]; then
  echo "[!] Kubeconfig not found at $KUBECONFIG"
  exit 1
fi

2. Cluster identity validation

CLUSTER_NODES=$(kubectl get nodes --no-headers | awk '{print $1}' | head -3)
if ! echo "$CLUSTER_NODES" | grep -qE "bvt-|integration-"; then
  echo "[!] SAFETY CHECK FAILED: Nodes don't match expected cluster"
  echo "[!] Nodes: $CLUSTER_NODES"
  exit 1
fi

Before any destructive operation, the script checks that the node names match the expected cluster pattern. If you’re supposed to target bvt-worker-0 but you’re connected to gke-staging-node-pool-abc123, the script aborts.

3. Interactive confirmation

echo "[*] Target cluster:"
echo "    Nodes:    $CLUSTER_NODES"
echo "    Endpoint: $CLUSTER_ENDPOINT"
read -r -p "    Proceed? (yes/no): " confirm
if [ "$confirm" != "yes" ]; then
  echo "[*] Aborted"
  exit 0
fi

4. Never trust the default context

The rule now: every script that touches a cluster explicitly sets its own KUBECONFIG. No script inherits the shell’s active context. If you forget, it fails loudly instead of silently deleting the wrong cluster.

Takeaway

GitOps is your safety net, not your prevention layer. Flux recovered staging because the desired state was in git. But the incident still happened because a script trusted the default kubectl context.

Defensive scripting for multi-cluster environments: always set KUBECONFIG explicitly, always validate the target cluster identity before destructive operations, and always require confirmation. The cost of three extra lines of bash is zero. The cost of deleting the wrong cluster is a Monday morning you don’t want.