Kasten Backup Alerter

Kasten Backup Alerter

https://github.com/jdtate101/kasten-alerter-service

A lightweight Kubernetes CronJob that monitors Kasten K10 backup jobs across multiple clusters and sends email alerts when failures are detected. Runs independently of the Kasten Backup Summary Dashboard — no browser required.


Overview

The alerter runs on a configurable schedule (default: hourly), queries Kasten's backupactions CRDs directly across all configured clusters, and sends an HTML email summarising any new failures. A deduplication mechanism ensures each failed run only triggers one alert, regardless of how many times the CronJob executes.

┌─────────────────────────────────────────────────────┐
│              OpenShift CronJob (hourly)              │
│                                                      │
│  alerter.py                                          │
│    ├── Query backupactions CRD (all clusters)        │
│    ├── Filter failures within lookback window        │
│    ├── Deduplicate against kasten-alerter-sent CM    │
│    ├── Send HTML email via Gmail SMTP                │
│    └── Update dedup ConfigMap                        │
└─────────────────────────────────────────────────────┘
         │                    │                    │
    K8s API             K8s API             K8s API
   (OpenShift)          (RKE2)              (K3s)
   in-cluster SA        SA token            SA token

Features

  • Multi-cluster — checks all clusters configured via CLUSTER_N_* env vars
  • Root cause extraction — drills into Kasten's nested error JSON to surface the actual failure reason
  • Deduplication — tracks sent alerts in a ConfigMap, preventing repeat emails for the same failure
  • Auto-expiry — dedup entries older than 7 days are pruned automatically
  • HTML email — clean Veeam-branded table showing cluster, run action, policy, namespaces, and error
  • Configurable lookback — set how far back to check (default 2h, slightly more than the run interval)
  • No external dependencies — only needs httpx, everything else is Python stdlib

Project Structure

alerter/
├── Dockerfile          # Python 3.12 Alpine image
├── alerter.py          # Main script
├── requirements.txt    # httpx only
└── k8s/
    ├── configmap.yaml  # SMTP settings, cluster URLs, lookback window
    ├── secret.yaml     # Gmail App Password + remote cluster SA tokens
    ├── rbac.yaml       # Role + RoleBinding for ConfigMap read/write
    └── cronjob.yaml    # CronJob definition — runs hourly by default

File Descriptions

File Purpose
alerter.py Discovers clusters from env vars, queries K8s API for failed backupactions, extracts root cause errors, deduplicates, sends email
Dockerfile Minimal Alpine Python image — no nginx, no FastAPI, just the script
k8s/configmap.yaml All non-sensitive config: SMTP host/port/user, recipient addresses, lookback window, cluster API URLs
k8s/secret.yaml Sensitive values: Gmail App Password, remote cluster SA tokens
k8s/rbac.yaml Namespace-scoped Role allowing the pod to read and write the dedup ConfigMap
k8s/cronjob.yaml CronJob spec — schedule, image, resource limits, env injection

Prerequisites

  • OpenShift cluster with Kasten K10 installed in kasten-io namespace
  • The kasten-dashboard-reader ClusterRole already applied (from the main dashboard deployment) — the alerter reuses it
  • A Gmail account with 2FA enabled
  • A Gmail App Password generated at https://myaccount.google.com/apppasswords
  • Remote clusters already prepared with the dashboardbff-svc service account and token (see main dashboard README)

Deployment

1. Prepare Remote Clusters

If not already done from the main dashboard setup, apply the remote cluster RBAC on each K3s/RKE2 cluster and retrieve the SA token:

# Apply on each remote cluster
kubectl apply -f ../k8s/remote-cluster-rbac.yaml

# Retrieve the token
kubectl -n kasten-io get secret dashboardbff-svc-token \
  -o jsonpath='{.data.token}' | base64 -d

2. Generate a Gmail App Password

  1. Ensure 2FA is enabled on your Gmail account
  2. Go to https://myaccount.google.com/apppasswords
  3. Select app: Mail, device: Other (name it "Kasten Alerter")
  4. Copy the 16-character password — you won't see it again

3. Configure the ConfigMap

Edit k8s/configmap.yaml:

data:
  SMTP_HOST: "smtp.gmail.com"
  SMTP_PORT: "587"
  SMTP_USER: "your-gmail@gmail.com"
  ALERT_FROM: "your-gmail@gmail.com"
  ALERT_TO: "recipient@example.com,another@example.com"
  ALERT_SUBJECT_PREFIX: "[Kasten Backup]"
  LOOKBACK_HOURS: "2"

  CLUSTER_1_NAME:       "openshift"
  CLUSTER_1_LABEL:      "OpenShift"
  CLUSTER_1_API_URL:    "https://kubernetes.default.svc"
  CLUSTER_1_IN_CLUSTER: "true"

  CLUSTER_2_NAME:       "rke2"
  CLUSTER_2_LABEL:      "RKE2"
  CLUSTER_2_API_URL:    "https://192.168.1.99:6443"

  CLUSTER_3_NAME:       "k3s"
  CLUSTER_3_LABEL:      "K3s"
  CLUSTER_3_API_URL:    "https://192.168.1.105:6443"
LOOKBACK_HOURS should be set to slightly more than the CronJob interval. With an hourly schedule, 2 ensures no gap between runs. With a 30-minute schedule, use 1.

4. Configure the Secret

Edit k8s/secret.yaml — never commit real values to git:

stringData:
  SMTP_PASSWORD: "abcd efgh ijkl mnop"   # Gmail App Password
  CLUSTER_2_TOKEN: "eyJhbGci..."          # RKE2 SA token
  CLUSTER_3_TOKEN: "eyJhbGci..."          # K3s SA token

Or apply directly without touching the file:

oc create secret generic kasten-alerter-secrets \
  --from-literal=SMTP_PASSWORD="your-app-password" \
  --from-literal=CLUSTER_2_TOKEN="your-rke2-token" \
  --from-literal=CLUSTER_3_TOKEN="your-k3s-token" \
  -n kasten-io --dry-run=client -o yaml | oc apply -f -

5. Update the Image Reference

Edit k8s/cronjob.yaml and set your registry:

image: harbor.your.domain/kasten-dashboard/kasten-alerter:latest

6. Build and Push

cd alerter/
docker build -t harbor.your.domain/kasten-dashboard/kasten-alerter:latest .
docker push harbor.your.domain/kasten-dashboard/kasten-alerter:latest

7. Deploy to OpenShift

oc apply -f k8s/rbac.yaml
oc apply -f k8s/configmap.yaml
oc apply -f k8s/secret.yaml
oc apply -f k8s/cronjob.yaml

Verify the CronJob is registered:

oc get cronjob -n kasten-io kasten-alerter

Testing

Run immediately without waiting for the schedule

oc create job kasten-alerter-test --from=cronjob/kasten-alerter -n kasten-io
oc logs -n kasten-io job/kasten-alerter-test -f

Force an alert email using a longer lookback window

If there are no recent failures to trigger an alert, temporarily extend the lookback to find historical failures:

# Extend lookback to 24 hours
oc patch configmap kasten-alerter-config -n kasten-io \
  --type=merge -p '{"data":{"LOOKBACK_HOURS":"24"}}'

# Run a test job
oc create job kasten-alerter-emailtest --from=cronjob/kasten-alerter -n kasten-io
oc logs -n kasten-io job/kasten-alerter-emailtest -f

# Reset lookback to normal
oc patch configmap kasten-alerter-config -n kasten-io \
  --type=merge -p '{"data":{"LOOKBACK_HOURS":"2"}}'

Clear the dedup cache (to re-send alerts for known failures)

oc delete configmap kasten-alerter-sent -n kasten-io

Changing the Schedule

Edit k8s/cronjob.yaml and update the schedule field using standard cron syntax:

Schedule Cron expression
Every hour 0 * * * *
Every 30 minutes */30 * * * *
Every 6 hours 0 */6 * * *
Daily at 07:00 UTC 0 7 * * *

After editing, apply and restart:

oc apply -f k8s/cronjob.yaml
Remember to adjust LOOKBACK_HOURS in the ConfigMap to match — it should be slightly more than the interval to prevent gaps.

Adding a New Cluster

Step 1 — Prepare the cluster

Follow the remote cluster RBAC steps in the main dashboard README (../README.md) to create the service account, ClusterRole, ClusterRoleBinding, and token secret.

Step 2 — Update the ConfigMap

oc edit configmap kasten-alerter-config -n kasten-io

Add the new cluster entries:

  CLUSTER_4_NAME:       "harvester"
  CLUSTER_4_LABEL:      "Harvester"
  CLUSTER_4_API_URL:    "https://192.168.1.200:6443"

Step 3 — Add the token to the Secret

oc patch secret kasten-alerter-secrets -n kasten-io \
  --type=merge \
  -p '{"stringData":{"CLUSTER_4_TOKEN":"eyJhbGci..."}}'

Step 4 — Test

oc create job kasten-alerter-newcluster --from=cronjob/kasten-alerter -n kasten-io
oc logs -n kasten-io job/kasten-alerter-newcluster -f

No rebuild required.


Deduplication

The alerter stores sent alert IDs in a ConfigMap named kasten-alerter-sent in the kasten-io namespace. Each alert ID has the format:

{cluster}:{run-action-name}:{date}

For example: k3s:run-f2776nnj4l:2026-04-02

This means:

  • The same failed run will only trigger one email per day
  • If a failure persists across multiple days it will alert again each day
  • Entries older than 7 days are pruned automatically on each run

To inspect the dedup cache:

oc get configmap kasten-alerter-sent -n kasten-io -o jsonpath='{.data.sent}' | python3 -m json.tool

Email Format

The alert email contains an HTML table with:

Column Description
Cluster The display label from CLUSTER_N_LABEL
Run Action The Kasten run action name (e.g. run-f2776nnj4l)
Policy The policy that triggered the run
Namespaces All namespaces affected by this run
Error Root cause extracted from Kasten's nested error chain

A plain-text fallback is also included for email clients that don't render HTML.


Troubleshooting

401 Unauthorized on a remote cluster

The SA token in the secret is invalid or expired. Retrieve a fresh token:

kubectl -n kasten-io get secret dashboardbff-svc-token \
  -o jsonpath='{.data.token}' | base64 -d

Then update the secret and re-run.

Email not sending

  • Check SMTP_USER, SMTP_PASSWORD and ALERT_TO are set correctly
  • Ensure 2FA is enabled on the Gmail account and the App Password is current
  • Check the job logs for the specific SMTP error
  • Test SMTP connectivity from the pod: oc exec -n kasten-io job/<job-name> -- python3 -c "import smtplib; s=smtplib.SMTP('smtp.gmail.com',587); s.starttls(); print('OK')"

No failures found despite known failures

  • Increase LOOKBACK_HOURS beyond the age of the failure
  • Check the failure state: Kasten uses Failed (capital F) — verify with the debug endpoint on the dashboard: /api/debug/{cluster}?path=apis/actions.kio.kasten.io/v1alpha1/backupactions

View recent job history

oc get jobs -n kasten-io | grep alerter
oc logs -n kasten-io job/<job-name>

Updating the Alerter

# Make changes to alerter.py, then:
docker build -t harbor.your.domain/kasten-dashboard/kasten-alerter:latest .
docker push harbor.your.domain/kasten-dashboard/kasten-alerter:latest

# The next CronJob execution will pull the new image automatically
# (imagePullPolicy: Always is set in cronjob.yaml)

License

MIT