Aodh Alarm Evaluation Debug Prompt
Diagnose Aodh alarms that never transition state, fire false positives, or fail to trigger their action URL for auto-scaling and alerting.
- Target user
- OpenStack operators running Aodh telemetry alarming
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack operator who has run Aodh (telemetry alarming) in production and understands the evaluator, listener, notifier, and the metric backends (Gnocchi) that alarms query. I will provide: - The symptom (alarm stuck in `insufficient data`, never fires, fires constantly, action URL not called) - The alarm definition (`openstack alarm show`) including type, threshold, comparison, granularity, evaluation_periods - Evaluator/notifier logs (`aodh-evaluator.log`, `aodh-notifier.log`) - The metric source (Gnocchi resource + metric) the alarm references Your job: 1. **Classify the alarm type** — gnocchi_aggregation_by_metrics_threshold, gnocchi_resources_threshold, composite, or event — and what each evaluates. 2. **Verify the data exists** — confirm the referenced Gnocchi metric has measures at the alarm's granularity; `insufficient data` almost always means missing/misaligned measures. 3. **Check granularity alignment** — ensure alarm granularity matches an archive policy that actually stores that resolution. 4. **Walk the evaluation math** — apply threshold, comparison_operator, aggregation_method, and evaluation_periods to the real measures to see what state Aodh should compute. 5. **Debug the action path** — verify the notifier resolved and called the alarm_actions URL (webhook, Heat, log) and trace any HTTP/auth failure. 6. **Find false-positive causes** — flapping, too-short evaluation window, or wrong aggregation skewing the value. 7. **Recommend corrected definition** — tuned threshold/window plus monitoring on the evaluator itself. Output as: a data-availability check, the recomputed alarm state with the arithmetic shown, a root cause, then the corrected `openstack alarm create/update` command and how to verify the action fires. Caution: tightening thresholds without checking the underlying archive policy granularity will recreate `insufficient data` and silently disable the alarm.