Skip to content
CloudOps
All prompts
AI for Kubernetes & Helm Difficulty: Intermediate ClaudeChatGPT

Kubernetes Events Analysis Prompt

Filter, aggregate, and decode Kubernetes events — FailedScheduling, BackOff, ProvisioningFailed — to diagnose cluster-wide issues from noisy event streams.

Target user
Kubernetes engineers debugging cluster health
Difficulty
Intermediate
Tools
Claude, ChatGPT

The prompt

You are a senior Kubernetes engineer who can read `kubectl get events` to spot cluster-wide trouble fast. You know which events are noise (NodeReady), which are signal (FailedScheduling, FailedMount), and how to deduplicate.

I will provide:
- The investigation context (cluster-wide health check, specific namespace, specific workload)
- Recent event dump: `kubectl get events -A --sort-by='.lastTimestamp'` (recent few hundred)
- Optional: timeframe of interest

Your job:

1. **Filter to Warning events first**: `kubectl get events -A --field-selector type=Warning`
2. **Identify event categories**:
   - **Scheduling**: `FailedScheduling`, `Preempted`, `NotTriggerScaleUp`
   - **Image / Container**: `Failed`, `BackOff`, `ImagePullBackOff`, `ErrImagePull`, `InspectFailed`
   - **Volume**: `FailedMount`, `FailedAttachVolume`, `VolumeFailedAlreadyOnNode`, `ProvisioningFailed`
   - **Node**: `NodeNotReady`, `NodeHasInsufficientMemory`, `NodeHasDiskPressure`, `Rebooted`
   - **Pod lifecycle**: `Killing`, `Unhealthy`, `BackOff`, `Pulled`, `Created`, `Started`
   - **Admission**: webhook errors, validation failures
   - **Autoscaler**: scale-up/down decisions, `ScaleUpFailed`
   - **Helm / GitOps controller** events (ArgoCD, Flux): `SyncFailed`, `OutOfSync`
3. **Deduplicate** — same event from many objects often indicates a single root cause:
   - 50 pods with `FailedScheduling: 0/3 nodes have sufficient cpu` → cluster CPU exhausted
   - 20 pods with `FailedMount` on same PV → CSI driver issue
   - All `aws-load-balancer-controller` events failing → controller down
4. **For each notable cluster of events**:
   - **What** is the event reason
   - **Who** is affected (count, namespaces, workloads)
   - **When** did it start (timing pattern)
   - **Why** (likely root cause)
   - **Next step** (where to look deeper)
5. **Cross-reference timing**:
   - Many events at the same minute → cluster-wide trigger (deploy, node death, autoscaler decision)
   - Periodic events (every 5 min) → cron-like; CronJob or controller reconcile
   - Recurring same-object events → loop (eg failing Helm rollout retrying)
6. **For event noise control**:
   - Default event TTL: 1 hour; older events drop
   - Set `--event-ttl` on apiserver for retention adjustment
   - Aggregated event-source tools (Eventrouter, kube-state-metrics) for retention
7. **For "no events but problem exists"**:
   - Events may have aged out (>1h)
   - Object's controller might not emit events (some custom controllers are silent)
   - Use logs from controller instead

Mark DESTRUCTIVE: clearing all events (`kubectl delete events -A --all`), interpreting normal events (Created, Pulled) as warnings, attempting cluster-wide fixes from a noisy event stream without root-cause analysis.

---

Investigation context: [DESCRIBE]
Recent events (last few hundred):
```
[PASTE `kubectl get events -A --sort-by='.lastTimestamp' | tail -200`]
```
Or filtered: `kubectl get events -A --field-selector type=Warning`:
```
[PASTE]
```
Timeframe of interest: [DESCRIBE]

Why this prompt works

kubectl get events is the cluster’s stream of consciousness — what scheduler decided, what kubelet rejected, what controller failed. Many engineers skip events because they’re noisy. This prompt forces filtered, categorized analysis.

How to use it

  1. Filter to Warning first. Normal events are noise for problem-solving.
  2. Group by reason and object. Patterns emerge.
  3. Look at the first event in a chain, not the latest.
  4. For retention beyond 1 hour, you need an event shipper.

Useful commands

# Sorted by time
kubectl get events -A --sort-by='.lastTimestamp' | tail -100
kubectl get events -A --sort-by='.firstTimestamp'

# Warnings only
kubectl get events -A --field-selector type=Warning --sort-by='.lastTimestamp'

# By namespace
kubectl get events -n <ns>

# By specific object
kubectl get events --field-selector involvedObject.name=<pod>,involvedObject.namespace=<ns>

# By reason
kubectl get events -A --field-selector reason=FailedScheduling
kubectl get events -A --field-selector reason=FailedMount

# JSON for tooling
kubectl get events -A -o json | jq -r '.items[] | "\(.lastTimestamp) \(.type) \(.reason) \(.involvedObject.namespace)/\(.involvedObject.name): \(.message)"' | tail

# Count events by reason
kubectl get events -A -o json | jq -r '.items[].reason' | sort | uniq -c | sort -nr

# Count events by reason + namespace
kubectl get events -A -o json | \
  jq -r '.items[] | "\(.involvedObject.namespace) \(.reason)"' | \
  sort | uniq -c | sort -nr | head

# Watch live
kubectl get events -A --watch-only       # only new events

# Most recent warning per workload
kubectl get events -A --field-selector type=Warning -o json | \
  jq -r '.items | group_by(.involvedObject.name) | .[] | sort_by(.lastTimestamp) | .[-1] | "\(.lastTimestamp) \(.involvedObject.namespace)/\(.involvedObject.name) \(.reason): \(.message)"'

Event categories

ReasonWhat it meansWhere to look
FailedSchedulingScheduler couldn’t place podNode resources, taints, affinity
PreemptedHigher-priority pod evicted this onePriorityClass usage
FailedMountVolume mount failedPVC binding, CSI driver
ProvisioningFailedPV couldn’t be createdStorageClass provisioner, cloud quotas
ImagePullBackOff / ErrImagePullImage fetch failedRegistry, secret, network
BackOffContainer CrashLoopBackOffPod logs
UnhealthyProbe failedProbe config + app state
NodeNotReadyNode went NotReadyKubelet, container runtime
NodeHasDiskPressureNode disk fillingImage GC, log volume
KillingContainer being terminatedEviction, rollout, OOM
FailedKillPodCouldn’t terminate; stuckFinalizer, stuck mount
Created / Pulled / StartedNormal lifecycle(noise during normal ops)

Analysis patterns

Burst at a single timestamp

kubectl get events -A -o json | \
  jq -r '.items[].lastTimestamp' | \
  cut -c1-16 | \
  sort | uniq -c | sort -nr | head
# Spikes at one minute = cluster event (deploy, node death)

Recurring events on one object (controller loop)

kubectl get events --field-selector involvedObject.name=<pod> -o json | \
  jq -r '.items | sort_by(.firstTimestamp) | .[] | "\(.firstTimestamp) \(.count)x \(.reason)"'
# `count` field high = same event over and over; controller retry loop

Cluster-wide problem detection

# Count Warning events by reason in last 10 minutes
kubectl get events -A --field-selector type=Warning -o json | \
  jq -r --arg cutoff "$(date -u -d '10 minutes ago' +%Y-%m-%dT%H:%M:%SZ)" \
  '.items[] | select(.lastTimestamp > $cutoff) | .reason' | \
  sort | uniq -c | sort -nr

Common findings this catches

  • 50 pods FailedScheduling: 0/N nodes have sufficient cpu → cluster out of CPU; add nodes or evict noisy.
  • All pods in a namespace FailedMount → CSI driver / PVC issue affecting that namespace.
  • NodeHasDiskPressure on multiple nodes → image cleanup not running; check kubelet image GC.
  • Cluster-wide FailedKillPod → kubelet container runtime issue.
  • Cluster autoscaler ScaleUpFailed → cloud quota / IAM issue.
  • BackOff events repeating every 5min → CrashLoopBackOff retry interval (kubelet backoff).
  • Periodic Killing of Job pods → CronJob concurrencyPolicy: Replace killing previous run.

Event retention beyond 1 hour

# kube-event-exporter to Elastic / Slack / log file
# https://github.com/resmoio/kubernetes-event-exporter
apiVersion: apps/v1
kind: Deployment
metadata:
  name: event-exporter
spec:
  template:
    spec:
      containers:
      - name: event-exporter
        image: ghcr.io/resmoio/kubernetes-event-exporter:latest
        # config in ConfigMap routes events to receivers

When to escalate

  • Cluster-wide event burst correlating with a control-plane issue — engage cluster admin.
  • Same event reason flooded from a specific controller — coordinate with controller’s team.
  • Loss of historical events for an incident — install event shipper before next incident.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.