Never Miss a Crashing Pod Again

Never Miss a Crashing Pod Again

GitHub - jdtate101/cluster-monitor: Simple multi cluster monitor with email alerts
Simple multi cluster monitor with email alerts. Contribute to jdtate101/cluster-monitor development by creating an account on GitHub.

A lightweight Docker daemon that watches all your Kubernetes clusters and emails you the moment something stays broken for more than ten minutes.


If you run multiple Kubernetes clusters — an OpenShift lab, an RKE2 node, a K3s edge cluster — you know the experience: you ssh in two hours later to find a pod that has been crash-looping since before lunch, silently. No alert, no email, just a wall of CrashLoopBackOff in the event log and a service that's been down.

K8s Pod Monitor is the fix. It's a single Docker container that polls every cluster's API every 60 seconds, tracks how long each pod has been unhealthy, and fires a Gmail alert once anything has been broken for ten minutes. When the pod recovers, a follow-up email tells you it's back.


How it works

The daemon hits /api/v1/pods directly on each cluster's API server using a ServiceAccount bearer token. Failure state is kept in memory — nothing to persist, nothing to configure. A restart resets all the timers, which is fine: if a pod was failing before the restart it'll re-enter the failure window within a poll cycle or two.

The ten-minute threshold exists to avoid alert fatigue from transient blips: a pod that restarts cleanly after one OOMKilled is not worth a 3am email. One that's been stuck in ImagePullBackOff for fifteen minutes definitely is.


What counts as a failure

Trigger Detail
Pod phase Failed, Unknown
Container waiting CrashLoopBackOff, OOMKilled, ImagePullBackOff, ErrImagePull, CreateContainerConfigError, and others
Container terminated non-zero Any exit code ≠ 0
Init container failures Same reasons, prefixed with Init:
Succeeded pods Ignored — completed jobs are not errors

The system namespaces (kube-system, kube-public, kube-node-lease) are skipped by default. You can add your own via SKIP_NAMESPACES.


Prerequisites

  • Docker + Docker Compose on the host running the monitor
  • kubectl access to each cluster for the initial RBAC setup
  • A Gmail account with 2FA enabled (required for App Passwords)
  • Network line-of-sight from the Docker host to each cluster's API server on port 6443

Setup in five steps

Step 1 — Create ServiceAccounts on each cluster

Apply the included rbac.yaml to every cluster. It creates a pod-monitor namespace and ServiceAccount, a pod-monitor-reader ClusterRole with read-only pod access, and a long-lived token Secret.

oc apply -f rbac.yaml                                      # OpenShift
kubectl apply -f rbac.yaml --kubeconfig ~/.kube/rke2.yaml  # RKE2
kubectl apply -f rbac.yaml                                 # K3s
RKE2 + Rancher quirk: If Rancher's webhook service is missing, namespace creation will be blocked. Patch the failing webhooks to Ignore first, then apply the manifest normally.

Step 2 — Extract tokens

Run this on each cluster to get the decoded bearer token. The | base64 -d is not optional — the raw Secret data is base64-encoded, and pasting it without decoding will return a 401 on every poll.

# OpenShift
oc get secret pod-monitor-token -n pod-monitor \
  -o jsonpath='{.data.token}' | base64 -d

# RKE2
kubectl get secret pod-monitor-token -n pod-monitor \
  --kubeconfig ~/.kube/rke2.yaml \
  -o jsonpath='{.data.token}' | base64 -d

# K3s
kubectl get secret pod-monitor-token -n pod-monitor \
  -o jsonpath='{.data.token}' | base64 -d

Step 3 — Gmail App Password

Go to myaccount.google.com → Security → 2-Step Verification → App passwords, create a new password named something like "K8s Monitor", and copy the 16-character string. You won't see it again.

Step 4 — Configure docker-compose.yml

Paste your credentials and tokens into the compose file. All three lab clusters use self-signed API server certificates, so set INSECURE: "true" for each:

GMAIL_USER: "you@gmail.com"
GMAIL_APP_PASS: "abcd efgh ijkl mnop"
ALERT_TO: "you@gmail.com"

CLUSTER_1_NAME: "openshift"
CLUSTER_1_URL: "https://api.openshift2.lab.home:6443"
CLUSTER_1_TOKEN: "eyJhbGci..."
CLUSTER_1_INSECURE: "true"

CLUSTER_2_NAME: "rke2"
CLUSTER_2_URL: "https://192.168.1.99:6443"
CLUSTER_2_TOKEN: "eyJhbGci..."
CLUSTER_2_INSECURE: "true"

CLUSTER_3_NAME: "k3s"
CLUSTER_3_URL: "https://192.168.1.105:6443"
CLUSTER_3_TOKEN: "eyJhbGci..."
CLUSTER_3_INSECURE: "true"

If you have a CA bundle instead, supply the path and mount it as a read-only volume — no code changes needed:

CLUSTER_1_CA_CERT: "/certs/my-ca.crt"
volumes:
  - ~/certs/my-ca.crt:/certs/my-ca.crt:ro

Step 5 — Build and run

cd k8s-pod-monitor
docker compose up --build -d
docker compose logs -f

A healthy startup looks like:

[INFO] clusters: Registered cluster 1: openshift → https://api.openshift2.lab.home:6443 (insecure=True)
[INFO] clusters: Registered cluster 2: rke2 → https://192.168.1.99:6443 (insecure=True)
[INFO] clusters: Registered cluster 3: k3s → https://192.168.1.105:6443 (insecure=True)
[INFO] monitor: Starting pod monitor | clusters=3 | poll=60s | threshold=10min

Testing it

The easiest trigger is pointing a deployment at a tag that doesn't exist:

kubectl set image deployment/<name> -n <namespace> <container>=nginx:thistagdoesnotexist999

This produces ImagePullBackOff immediately. With LOG_LEVEL: "DEBUG" you'll see the failure timer counting up each poll:

[DEBUG] monitor: [k3s] retro-game/arcadians-api-xxx failing (ImagePullBackOff) for 1.0 min — alert_sent=False
[DEBUG] monitor: [k3s] retro-game/arcadians-api-xxx failing (ImagePullBackOff) for 2.0 min — alert_sent=False
...
[INFO]  monitor: ALERT: [k3s] retro-game/arcadians-api-xxx (ImagePullBackOff) failing for 10.0 min

After ten minutes you get the alert email, and when you roll back with kubectl rollout undo the recovery email follows.

kubectl rollout undo deployment/<name> -n <namespace>

Tuning

Variable Default Description
POLL_INTERVAL_SECONDS 60 How often to query all clusters
ALERT_THRESHOLD_MINUTES 10 Minutes before an alert fires
SKIP_NAMESPACES Comma-separated extra namespaces to ignore
LOG_LEVEL INFO Set DEBUG for per-pod poll output

Adding a cluster

Add a new numbered block to docker-compose.yml — no code changes, no rebuild needed:

CLUSTER_4_NAME: "my-cluster"
CLUSTER_4_URL: "https://192.168.1.x:6443"
CLUSTER_4_TOKEN: "eyJhbGci..."
CLUSTER_4_INSECURE: "true"

Networking note

The container runs with network_mode: host so it can reach LAN addresses and internal DNS names like api.openshift2.lab.home without any extra routing config. The monitor only makes outbound HTTPS calls to port 6443 — nothing is exposed inbound.


State is in-memory only. A container restart resets all failure timers — but any pod that was already broken will re-enter the alert window within a poll cycle or two, so in practice this is rarely an issue.