Kubernetes Events Analysis Prompt
Filter, aggregate, and decode Kubernetes events — FailedScheduling, BackOff, ProvisioningFailed — to diagnose cluster-wide issues from noisy event streams.
- Target user
- Kubernetes engineers debugging cluster health
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior Kubernetes engineer who can read `kubectl get events` to spot cluster-wide trouble fast. You know which events are noise (NodeReady), which are signal (FailedScheduling, FailedMount), and how to deduplicate. I will provide: - The investigation context (cluster-wide health check, specific namespace, specific workload) - Recent event dump: `kubectl get events -A --sort-by='.lastTimestamp'` (recent few hundred) - Optional: timeframe of interest Your job: 1. **Filter to Warning events first**: `kubectl get events -A --field-selector type=Warning` 2. **Identify event categories**: - **Scheduling**: `FailedScheduling`, `Preempted`, `NotTriggerScaleUp` - **Image / Container**: `Failed`, `BackOff`, `ImagePullBackOff`, `ErrImagePull`, `InspectFailed` - **Volume**: `FailedMount`, `FailedAttachVolume`, `VolumeFailedAlreadyOnNode`, `ProvisioningFailed` - **Node**: `NodeNotReady`, `NodeHasInsufficientMemory`, `NodeHasDiskPressure`, `Rebooted` - **Pod lifecycle**: `Killing`, `Unhealthy`, `BackOff`, `Pulled`, `Created`, `Started` - **Admission**: webhook errors, validation failures - **Autoscaler**: scale-up/down decisions, `ScaleUpFailed` - **Helm / GitOps controller** events (ArgoCD, Flux): `SyncFailed`, `OutOfSync` 3. **Deduplicate** — same event from many objects often indicates a single root cause: - 50 pods with `FailedScheduling: 0/3 nodes have sufficient cpu` → cluster CPU exhausted - 20 pods with `FailedMount` on same PV → CSI driver issue - All `aws-load-balancer-controller` events failing → controller down 4. **For each notable cluster of events**: - **What** is the event reason - **Who** is affected (count, namespaces, workloads) - **When** did it start (timing pattern) - **Why** (likely root cause) - **Next step** (where to look deeper) 5. **Cross-reference timing**: - Many events at the same minute → cluster-wide trigger (deploy, node death, autoscaler decision) - Periodic events (every 5 min) → cron-like; CronJob or controller reconcile - Recurring same-object events → loop (eg failing Helm rollout retrying) 6. **For event noise control**: - Default event TTL: 1 hour; older events drop - Set `--event-ttl` on apiserver for retention adjustment - Aggregated event-source tools (Eventrouter, kube-state-metrics) for retention 7. **For "no events but problem exists"**: - Events may have aged out (>1h) - Object's controller might not emit events (some custom controllers are silent) - Use logs from controller instead Mark DESTRUCTIVE: clearing all events (`kubectl delete events -A --all`), interpreting normal events (Created, Pulled) as warnings, attempting cluster-wide fixes from a noisy event stream without root-cause analysis. --- Investigation context: [DESCRIBE] Recent events (last few hundred): ``` [PASTE `kubectl get events -A --sort-by='.lastTimestamp' | tail -200`] ``` Or filtered: `kubectl get events -A --field-selector type=Warning`: ``` [PASTE] ``` Timeframe of interest: [DESCRIBE]
Why this prompt works
kubectl get events is the cluster’s stream of consciousness — what scheduler decided, what kubelet rejected, what controller failed. Many engineers skip events because they’re noisy. This prompt forces filtered, categorized analysis.
How to use it
- Filter to Warning first. Normal events are noise for problem-solving.
- Group by reason and object. Patterns emerge.
- Look at the first event in a chain, not the latest.
- For retention beyond 1 hour, you need an event shipper.
Useful commands
# Sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -100
kubectl get events -A --sort-by='.firstTimestamp'
# Warnings only
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'
# By namespace
kubectl get events -n <ns>
# By specific object
kubectl get events --field-selector involvedObject.name=<pod>,involvedObject.namespace=<ns>
# By reason
kubectl get events -A --field-selector reason=FailedScheduling
kubectl get events -A --field-selector reason=FailedMount
# JSON for tooling
kubectl get events -A -o json | jq -r '.items[] | "\(.lastTimestamp) \(.type) \(.reason) \(.involvedObject.namespace)/\(.involvedObject.name): \(.message)"' | tail
# Count events by reason
kubectl get events -A -o json | jq -r '.items[].reason' | sort | uniq -c | sort -nr
# Count events by reason + namespace
kubectl get events -A -o json | \
jq -r '.items[] | "\(.involvedObject.namespace) \(.reason)"' | \
sort | uniq -c | sort -nr | head
# Watch live
kubectl get events -A --watch-only # only new events
# Most recent warning per workload
kubectl get events -A --field-selector type=Warning -o json | \
jq -r '.items | group_by(.involvedObject.name) | .[] | sort_by(.lastTimestamp) | .[-1] | "\(.lastTimestamp) \(.involvedObject.namespace)/\(.involvedObject.name) \(.reason): \(.message)"'
Event categories
| Reason | What it means | Where to look |
|---|---|---|
FailedScheduling | Scheduler couldn’t place pod | Node resources, taints, affinity |
Preempted | Higher-priority pod evicted this one | PriorityClass usage |
FailedMount | Volume mount failed | PVC binding, CSI driver |
ProvisioningFailed | PV couldn’t be created | StorageClass provisioner, cloud quotas |
ImagePullBackOff / ErrImagePull | Image fetch failed | Registry, secret, network |
BackOff | Container CrashLoopBackOff | Pod logs |
Unhealthy | Probe failed | Probe config + app state |
NodeNotReady | Node went NotReady | Kubelet, container runtime |
NodeHasDiskPressure | Node disk filling | Image GC, log volume |
Killing | Container being terminated | Eviction, rollout, OOM |
FailedKillPod | Couldn’t terminate; stuck | Finalizer, stuck mount |
Created / Pulled / Started | Normal lifecycle | (noise during normal ops) |
Analysis patterns
Burst at a single timestamp
kubectl get events -A -o json | \
jq -r '.items[].lastTimestamp' | \
cut -c1-16 | \
sort | uniq -c | sort -nr | head
# Spikes at one minute = cluster event (deploy, node death)
Recurring events on one object (controller loop)
kubectl get events --field-selector involvedObject.name=<pod> -o json | \
jq -r '.items | sort_by(.firstTimestamp) | .[] | "\(.firstTimestamp) \(.count)x \(.reason)"'
# `count` field high = same event over and over; controller retry loop
Cluster-wide problem detection
# Count Warning events by reason in last 10 minutes
kubectl get events -A --field-selector type=Warning -o json | \
jq -r --arg cutoff "$(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ)" \
'.items[] | select(.lastTimestamp > $cutoff) | .reason' | \
sort | uniq -c | sort -nr
Common findings this catches
- 50 pods
FailedScheduling: 0/N nodes have sufficient cpu→ cluster out of CPU; add nodes or evict noisy. - All pods in a namespace
FailedMount→ CSI driver / PVC issue affecting that namespace. NodeHasDiskPressureon multiple nodes → image cleanup not running; check kubelet image GC.- Cluster-wide
FailedKillPod→ kubelet container runtime issue. - Cluster autoscaler
ScaleUpFailed→ cloud quota / IAM issue. BackOffevents repeating every 5min → CrashLoopBackOff retry interval (kubelet backoff).- Periodic
Killingof Job pods → CronJobconcurrencyPolicy: Replacekilling previous run.
Event retention beyond 1 hour
# kube-event-exporter to Elastic / Slack / log file
# https://github.com/resmoio/kubernetes-event-exporter
apiVersion: apps/v1
kind: Deployment
metadata:
name: event-exporter
spec:
template:
spec:
containers:
- name: event-exporter
image: ghcr.io/resmoio/kubernetes-event-exporter:latest
# config in ConfigMap routes events to receivers
When to escalate
- Cluster-wide event burst correlating with a control-plane issue — engage cluster admin.
- Same event reason flooded from a specific controller — coordinate with controller’s team.
- Loss of historical events for an incident — install event shipper before next incident.
Related prompts
-
Kubernetes Deployment Rollout Debug Prompt
Diagnose stuck Deployment rollouts — `ProgressDeadlineExceeded`, replica set churn, maxSurge/maxUnavailable misconfig, image pull pacing, and stuck-mid-rollout recovery.
-
Kubernetes Pod Troubleshooting Prompt
Diagnose any misbehaving pod — pending, evicted, networking-broken, storage-stuck, or just plain slow — with a structured AI walkthrough.
-
Kubernetes `FailedScheduling` Debug Prompt
Diagnose `FailedScheduling` events — taints/tolerations mismatch, node affinity, topology spread skew, resource fit failures, and PV zone constraints.