OpenStack Telemetry and Alarming with Ceilometer and Aodh

OpenStack telemetry has a reputation for being fiddly, and honestly it earns it. The pipeline has three moving parts that people constantly conflate: Ceilometer gathers the measurements, Gnocchi stores them as time series, and Aodh evaluates alarms against that data. When an alarm “doesn’t work,” the failure could be in any of the three, and they fail in completely different ways. I run this stack for autoscaling and capacity alerts, and this guide is the mental model and the commands that keep me from chasing the wrong component.

Understand the three-stage pipeline

Before any debugging, get clear on who does what. Ceilometer polls and listens for events. It does not store anything long-term — it publishes to Gnocchi. Gnocchi is the time-series database. Aodh reads Gnocchi and triggers actions.

openstack metric status
gnocchi resource list --type instance | head
openstack alarm list

If openstack metric status shows a healthy Gnocchi with measures being processed, your collection path is fine and the problem is in Aodh. If Gnocchi has a growing backlog of unprocessed measures, collection is outrunning storage and no alarm built on fresh data will be reliable. Knowing which side of Gnocchi you are on saves the most time.

Confirm data is actually being measured

An alarm on a metric that has no datapoints will sit in insufficient data forever and never fire. So the first concrete check is whether the metric exists and is moving.

gnocchi metric list --resource-id <instance-uuid>
gnocchi measures show --resource-id <instance-uuid> cpu --aggregation rate:mean

The single most common telemetry ticket is “my CPU alarm never fires.” Nine times out of ten the metric being measured is raw cpu (nanoseconds, monotonic) and the alarm expects a rate:mean, so the alarm is comparing the wrong thing. Always look at the actual measures before blaming Aodh.

Pro Tip: Gnocchi aggregates by archive policy. If your archive policy does not include the granularity your alarm queries, you get empty results and insufficient data. Match the alarm’s evaluation window to a granularity the policy actually stores.

Build an alarm that fires

Once the data is flowing, the alarm itself is straightforward — but the threshold, granularity, and evaluation periods have to agree.

openstack alarm create \
  --name high-cpu \
  --type gnocchi_resources_threshold \
  --metric cpu \
  --resource-type instance \
  --resource-id <instance-uuid> \
  --aggregation-method rate:mean \
  --threshold 60000000000 \
  --comparison-operator gt \
  --granularity 300 \
  --evaluation-periods 3
openstack alarm show high-cpu -f value -c state -c state_reason

The state_reason field is gold. It tells you exactly what value Aodh evaluated and why the state is what it is — no guessing whether the threshold was crossed.

Debug an alarm stuck in “insufficient data”

This is the canonical telemetry failure. Work the evaluator outward: confirm Aodh is evaluating, confirm it can query Gnocchi, confirm the query returns points.

journalctl -u aodh-evaluator -n 100 --no-pager
openstack alarm show high-cpu -f value -c rule
# Run the same query Aodh runs, by hand:
gnocchi aggregates '(metric cpu rate:mean)' resource_type=instance id=<uuid>

If the hand-run aggregate returns numbers but the alarm says insufficient data, your granularity or evaluation window does not line up with the archive policy. If the aggregate is empty too, the problem is upstream in collection, not in Aodh.

Where AI speeds up the loop

Telemetry debugging is a lot of cross-referencing: alarm rule versus archive policy versus actual measures. That correlation work is tedious and exactly what an AI assistant is good at as a fast junior engineer. I paste the openstack alarm show rule, the gnocchi archive-policy show output, and a sample of measures, and ask the model to tell me whether the granularities and aggregation methods are compatible. It catches mismatches faster than I do.

The boundaries matter, though. I give it sanitized resource IDs and config, never my Keystone admin token and never the real clouds.yaml. The model explains the mismatch and proposes a corrected alarm; I create the alarm myself after reviewing it, because a wrong threshold on an autoscaling alarm can scale a cluster into the ground. The prompt library has config-diffing prompts, and the monitoring alerts dashboard is where I keep the alarm-review workflow.

gnocchi archive-policy show low  # the granularities the model needs to see

A tool like ChatGPT is reliable for “do these granularities match,” but I never let it press the trigger on an alarm that fans out to an autoscaling group.

Tame Gnocchi before it eats your disks

The quiet killer of OpenStack telemetry is Gnocchi storage growth. Every metric, at every granularity in its archive policy, is stored for the policy’s retention. Multiply that by thousands of instances and a chatty default policy, and Gnocchi can fill a disk faster than anyone expects. Capacity planning here is not optional.

gnocchi archive-policy list
gnocchi archive-policy show low -f value -c definition
gnocchi status   # measures backlog and processing health

The archive policy definition is where the cost lives — a policy keeping 5-second granularity for a month is enormously more expensive than one keeping 5-minute granularity for a week. Match retention to what your alarms and dashboards actually query; nobody alarms on month-old 5-second data, so do not pay to store it. I audit archive policies whenever I add a new metric source, because the default policies are tuned for demos, not for a cloud with real instance counts.

When measures stop arriving

The most alarming telemetry failure is measures simply stopping — every alarm drifts to insufficient data at once. That points squarely at collection, not alarming. The notification agent and the polling agents are the usual suspects, and the AMQP bus between them is the common failure point.

journalctl -u ceilometer-polling -n 80 --no-pager | grep -iE 'error|timeout'
journalctl -u ceilometer-notification -n 80 --no-pager | grep -iE 'error|drop'
gnocchi status -f value -c 'storage/total number of measures to process'

A climbing measures-to-process count with healthy polling means Gnocchi cannot keep up and is queuing; a flat-zero count with errors in the polling agent means nothing is being collected at all. The two look identical from the alarm side — both produce insufficient data — but they are opposite problems with opposite fixes, so always check which side of Gnocchi the backlog is on before you touch anything.

Wire alarms to real actions

An alarm that only changes state is useless; you want it to call a webhook or trigger Heat autoscaling. Aodh fires an HTTP action on state transition.

openstack alarm update high-cpu \
  --alarm-action 'http://heat-endpoint/scale-out' \
  --ok-action 'http://heat-endpoint/ok'
openstack alarm-history show high-cpu

The alarm-history is the audit trail. When someone asks “did the autoscale actually trigger,” this is the answer, with timestamps and the exact transition that fired the action.

Conclusion

OpenStack telemetry only feels cursed until you internalize the three stages: Ceilometer measures, Gnocchi stores, Aodh evaluates. Debug them in that order, always confirm the metric has the granularity your alarm queries, and read state_reason and alarm-history instead of guessing. An AI assistant is a strong fast-junior partner for the granularity and config correlation that makes this tedious — give it sanitized config, keep prod creds out, verify its proposed alarm before you create it, and never let it pull the autoscaling trigger on your behalf. More OpenStack guides live under the OpenStack category.