Root Cause Analysis with OpenStack Vitrage and AI

The worst on-call page I ever got was forty alarms in ninety seconds: instances down, network ports failing, a Cinder volume timing out. The actual problem was a single failed compute host. Everything else was downstream noise. Sorting signal from echo by hand, half-asleep, is exactly the failure mode OpenStack Vitrage is built to prevent. Vitrage is the root-cause-analysis project: it ingests events and alarms, builds a graph of how your resources relate, and deduces that “compute-07 died, therefore these twelve alarms are one incident.”

Vitrage is powerful but its template language is unforgiving, and reading the entity graph takes practice. So I do what I do everywhere on this cloud: sketch the relationships myself, let an AI assistant expand the template YAML, and review every rule before it shapes how my alarms get correlated. The model is a fast junior engineer — fast at YAML, blind to the consequences of a wrong correlation.

Confirming Vitrage Is Watching

Vitrage needs data sources (Nova, Neutron, Aodh, Prometheus) feeding it. Start by confirming the entity graph has resources:

openstack vitrage topology show
openstack vitrage resource list

If resource list is empty, your data sources are not configured and Vitrage has nothing to correlate. That is almost always a datasources/types config problem, not a Vitrage bug.

Reading the Entity Graph

The topology output is a JSON graph of vertices (resources) and edges (relationships). It is dense. This is the first place AI earns its place in my workflow — I paste the topology JSON into a session and ask the model to summarize the graph: “Which compute hosts have the most dependent instances? Are there any orphaned vertices?” It turns a wall of JSON into a readable hierarchy in seconds.

To see active deduced alarms:

openstack vitrage alarm list
openstack vitrage alarm count

The magic is in the cause relationships. A deduced alarm points back to the underlying alarm Vitrage believes is the root. That pointer is what saves you during an alarm storm.

Writing Vitrage Templates

Templates define the correlation logic: “if a host is down AND an instance is on that host, deduce the instance alarm is caused by the host.” They have three sections — definitions, scenarios, and metadata. A skeleton:

metadata:
  version: 3
  name: host_down_causes_instance_alarm
  type: standard
definitions:
  entities:
    - entity:
        category: ALARM
        type: nagios
        name: host_down
        template_id: host_alarm
    - entity:
        category: RESOURCE
        type: nova.instance
        template_id: instance
  relationships:
    - relationship:
        source: host_alarm
        target: instance
        relationship_type: on
        template_id: alarm_on_instance

The scenarios block then says what to deduce when that pattern matches. This YAML is exactly the tedious, easy-to-typo work I delegate. I describe the causal rule in English — “a down host should suppress its instance alarms and mark them as caused by the host” — and the model produces a syntactically valid template. Then I read it like a hawk, because a wrong relationship type will silently correlate nothing.

Add and validate:

openstack vitrage template add --path host_down_template.yaml
openstack vitrage template validate --path host_down_template.yaml
openstack vitrage template list

Pro Tip: Always run template validate before template add, and ask your AI assistant to explain each template_id reference back to you in plain English. The single most common Vitrage mistake is a template_id that does not match between the definitions and scenarios sections — and a model is very good at catching that dangling reference.

Taming an Alarm Storm

Here is the payoff. During a real storm I pull the deduced alarms and their causes, then hand the list to Claude with the prompt: “Group these by root cause and tell me the one resource I should look at first.” Vitrage has already done the graph correlation; the AI just renders it into a human action plan. That combination took my forty-alarm page and turned it into “go restart compute-07” in under a minute.

I run these storms through my monitoring alerts dashboard so the correlation and the eventual fix are recorded, and genuine incidents get logged in the incident response dashboard for the postmortem.

Connecting Vitrage to Aodh and Prometheus

Vitrage is only as smart as the signals it ingests. Out of the box it watches Nova and Neutron state changes, but the real value comes from wiring in your alarm sources — Aodh for OpenStack-native alarms and Prometheus for everything else. Confirm what Vitrage is actually consuming:

openstack vitrage resource list --type aodh.alarm

If your Aodh alarms are not showing up as Vitrage resources, the datasource is not configured and Vitrage cannot correlate them — it will dutifully report a host-down event with no connection to the dozen alarms it caused. I have AI draft the datasource config block from a description of my alarm sources, then verify the datasource type strings against the Vitrage docs, because an invented datasource name fails silently. Getting these inputs right is the unglamorous prerequisite to every clever correlation: the entity graph can only relate things it can see.

Tuning Correlations Over Time

Vitrage is only as good as its templates, and the templates need to evolve as you learn your failure modes. After every incident I ask: did Vitrage correctly identify the root cause? If not, I add or fix a template. I keep a running notebook of “correlations I wish I had” and periodically batch them into new templates with AI help.

openstack vitrage template show <template-uuid>

Reviewing what a template actually deduced versus what I intended is the feedback loop that makes Vitrage trustworthy.

Guardrails

Vitrage mostly observes, but its templates can drive actions (via Mistral) and definitely drive human decisions during incidents. So:

The AI drafts and explains templates; it never holds production credentials or pushes templates to the live cloud itself.
Every new template is validated and tested against a sandbox before going live.
I never let an AI conclusion be the only basis for a destructive remediation — Vitrage points at the host, I confirm the host is actually dead before I act.

My vetted Vitrage prompts live in the prompt workspace, and the reusable starters are in the OpenStack prompt pack. For editing template files inline I usually reach for Cursor.

The Takeaway

Vitrage does the graph math; an AI assistant translates that math into both correct templates and fast, human-readable incident summaries. Together they turned my worst on-call scenario — the alarm storm — into a solved problem. Keep the model in the junior-engineer seat, validate every template, and let the entity graph do what your half-asleep brain cannot.

The compounding benefit shows up over months. Every incident teaches you a new failure mode, every failure mode becomes a template, and the templates make the next storm quieter. A Vitrage deployment that started by collapsing forty alarms into one root cause eventually collapses most of your noise automatically, and your on-call engineers stop dreading the pager. That is the real goal — not fewer alarms, but alarms that already know what they mean.

Want help building a correlation template set tuned to your failure modes? Work with me, or keep exploring the OpenStack guides and the prompt library.