Monitoring OpenStack with Prometheus and Grafana

OpenStack ships with almost no useful monitoring out of the box. You get logs, you get the API, and you get the dawning realization that a cloud with a dozen interdependent services and no observability is a cloud that fails silently. After years of running these clouds, I treat Prometheus monitoring as a day-one deliverable, not a day-ninety nice-to-have.

Here’s the monitoring stack I build and the alerts that have actually caught real incidents.

The exporters you actually need

Prometheus monitors OpenStack through a few exporters. Don’t try to instrument everything at once; start with these four:

node_exporter — host CPU, RAM, disk, network on every controller and compute. Non-negotiable baseline.
openstack-exporter (the Sonataflow/openstack-exporter project) — scrapes the OpenStack APIs and exposes per-project quota usage, hypervisor capacity, service up/down, and agent state.
libvirt exporter — per-instance CPU, memory, and disk I/O from the hypervisor’s view.
rabbitmq_exporter and mysqld_exporter — the message bus and database are where OpenStack quietly dies. Watch them.

A minimal scrape config:

scrape_configs:
  - job_name: 'openstack'
    static_configs:
      - targets: ['controller:9180']   # openstack-exporter
  - job_name: 'node'
    static_configs:
      - targets: ['controller:9100', 'compute01:9100']
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['controller:9419']

The signal that matters: are the services up?

The openstack-exporter exposes service and agent state directly. The single most valuable alert in the whole stack:

- alert: OpenStackServiceDown
  expr: openstack_nova_agent_state == 0
  for: 5m
  labels: {severity: critical}
  annotations:
    summary: "Nova agent {{ $labels.hostname }} is down"

Add the equivalent for openstack_neutron_agent_state and the Cinder service state. A down L3 agent or a down nova-compute is the root cause of a huge fraction of “the cloud is broken” tickets, and this catches it before users do.

Capacity: the alerts that prevent NoValidHost

No valid host was found is the most user-visible OpenStack failure, and it’s entirely preventable with capacity alerting. The exporter gives you allocation data straight from placement:

- alert: HypervisorVcpuExhaustion
  expr: |
    openstack_nova_vcpus_used / openstack_nova_vcpus > 0.85
  for: 15m
  annotations:
    summary: "vCPU usage above 85% — schedule capacity"

Do the same for RAM and disk. Alert at 85% so you have time to add capacity before boots start failing, not after. This is the difference between proactive capacity planning and a 2am pager.

The infrastructure layer: RabbitMQ and MySQL

OpenStack’s coupling means the message bus and database are shared failure domains. The alerts that have saved me:

- alert: RabbitMQQueueBacklog
  expr: rabbitmq_queue_messages_ready > 1000
  for: 5m
  annotations:
    summary: "RabbitMQ queue {{ $labels.queue }} backing up — a consumer is stuck"

- alert: MySQLConnectionsHigh
  expr: mysql_global_status_threads_connected / mysql_global_variables_max_connections > 0.8
  for: 10m

A growing RabbitMQ messages_ready count almost always means a service consumer is wedged — and it’ll surface downstream as failed instance builds or stuck volumes minutes later. Catching it at the queue is catching it at the source.

Grafana dashboards that earn their place

I keep three dashboards, no more:

Cloud health — service/agent up-down grid, RabbitMQ queue depth, DB connections. The “is the control plane okay” view.
Capacity — vCPU/RAM/disk used vs. total per host and per aggregate, with quota usage per project. The “can we still schedule” view.
Per-tenant — top projects by resource consumption. The “who’s eating the cloud” view for chargeback and noisy-neighbor hunts.

Resist the urge to build forty panels. Three focused dashboards get looked at; forty get ignored.

Using AI to write and tune the rules

PromQL for OpenStack is fiddly because metric names are long and the ratios are non-obvious. I describe the intent and the available metrics to an LLM:

“Here are the OpenStack-exporter metric names for hypervisor vCPU, RAM, and disk usage. Write Prometheus alert rules that fire at 85% allocation with a 15-minute for, and explain the expression. Use only the metric names I gave you — do not invent any.”

That “do not invent metric names” line is load-bearing; without it, models cheerfully hallucinate openstack_nova_capacity_total and your rules silently never fire. Grounded in your real metrics, it’s a fast way to draft and explain rules. We keep a set of Prometheus alerting prompt templates tuned for exactly this, alongside the rest of our prompt library.

Treat monitoring as part of the cloud

The clouds I’ve run that didn’t page me at 2am were the ones where service-down, capacity-exhaustion, and queue-backlog alerts were wired up on day one. Those three categories cover the overwhelming majority of real OpenStack incidents.

Start with the four exporters, wire the service-state and capacity alerts first, keep three dashboards, and let AI draft the PromQL while you verify every metric name. A monitored OpenStack cloud is a calm one. For more monitoring prompts, browse our prompt library.

AI-generated alert rules are assistive, not authoritative. Validate every metric name and threshold against your own Prometheus before relying on them.