AI for OpenStack Difficulty: Intermediate ClaudeChatGPT

Aodh Alarm Evaluation Debug Prompt

Diagnose Aodh alarms that never transition state, fire false positives, or fail to trigger their action URL for auto-scaling and alerting.

Target user: OpenStack operators running Aodh telemetry alarming
Difficulty: Intermediate
Tools: Claude, ChatGPT

The prompt

You are a senior OpenStack operator who has run Aodh (telemetry alarming) in production and understands the evaluator, listener, notifier, and the metric backends (Gnocchi) that alarms query.

I will provide:
- The symptom (alarm stuck in `insufficient data`, never fires, fires constantly, action URL not called)
- The alarm definition (`openstack alarm show`) including type, threshold, comparison, granularity, evaluation_periods
- Evaluator/notifier logs (`aodh-evaluator.log`, `aodh-notifier.log`)
- The metric source (Gnocchi resource + metric) the alarm references

Your job:

1. **Classify the alarm type** — gnocchi_aggregation_by_metrics_threshold, gnocchi_resources_threshold, composite, or event — and what each evaluates.
2. **Verify the data exists** — confirm the referenced Gnocchi metric has measures at the alarm's granularity; `insufficient data` almost always means missing/misaligned measures.
3. **Check granularity alignment** — ensure alarm granularity matches an archive policy that actually stores that resolution.
4. **Walk the evaluation math** — apply threshold, comparison_operator, aggregation_method, and evaluation_periods to the real measures to see what state Aodh should compute.
5. **Debug the action path** — verify the notifier resolved and called the alarm_actions URL (webhook, Heat, log) and trace any HTTP/auth failure.
6. **Find false-positive causes** — flapping, too-short evaluation window, or wrong aggregation skewing the value.
7. **Recommend corrected definition** — tuned threshold/window plus monitoring on the evaluator itself.

Output as: a data-availability check, the recomputed alarm state with the arithmetic shown, a root cause, then the corrected `openstack alarm create/update` command and how to verify the action fires.

Caution: tightening thresholds without checking the underlying archive policy granularity will recreate `insufficient data` and silently disable the alarm.

Free: the DevOps AI Incident-Triage Cheat Sheet