Kubelet CSRs Stuck on Bare Metal — Nodes Went Dark

Multiple worker nodes in a bare-metal Kubernetes cluster suddenly stopped responding. They showed NotReady and all pods on them were evicted. The kubelet logs showed TLS handshake failures. The certificates had expired — silently, over weeks.

The Background

Kubernetes kubelets use TLS certificates to communicate with the API server. These certificates have expiration dates. On managed clusters (GKE, EKS), certificate rotation is automatic and invisible. On bare-metal clusters using KubeOne or kubeadm, it depends on a Certificate Signing Request (CSR) approval mechanism.

When a kubelet’s certificate is about to expire, it generates a new CSR and submits it to the API server. Something needs to approve that CSR. If nothing approves it, the kubelet keeps using the old cert until it expires, then loses API server connectivity.

What Happened

The cluster had been running fine for months. Nodes showed Ready. No alerts fired. Then one morning, three nodes went NotReady simultaneously.

The kubelet logs told the story:

TLS handshake error: remote error: tls: certificate has expired

I checked for pending CSRs:

kubectl get csr

Dozens of CSRs in Pending state, some weeks old. The kubelets had been requesting new certificates for weeks. Nobody approved them.

Why It Was Silent

The nodes stayed Ready until the exact moment the old certificate expired. There’s no warning state like “certificate expiring in 7 days.” The kubelet doesn’t log CSR submission failures as errors — it just retries silently. Monitoring tools watch node conditions (Ready/NotReady), not certificate expiration dates.

The Fix

Bulk-approve all pending CSRs:

kubectl get csr | grep Pending | awk '{print $1}' | xargs kubectl certificate approve

Nodes reconnected to the API server within seconds. All evicted pods were rescheduled.

For the longer-term fix, I added an auto-approver that watches for kubelet CSRs and approves them automatically — the same thing managed Kubernetes does behind the scenes but doesn’t exist by default on bare-metal clusters.

The Monitoring Gap

After this incident, I added a Prometheus alert for pending CSRs:

count(kube_certificatesigningrequest_condition{condition="Pending"}) > 0

And a certificate expiration alert:

(kube_node_status_condition{condition="Ready",status="true"} == 1)
  unless on(node)
(kubelet_certificate_manager_client_expiration_renew_errors == 0)

Takeaway

If you run bare-metal Kubernetes (KubeOne, kubeadm, Rancher), check if you have a CSR auto-approver for kubelet serving certificates. Managed clusters handle this invisibly. Bare-metal clusters don’t. Your nodes will work fine until the day the certificates expire, then go dark with no warning. Add monitoring for pending CSRs and certificate expiration — the default Kubernetes setup doesn’t alert on either.