Validating OpenStack Clouds with Tempest and AI

After every OpenStack upgrade there is a moment of dread: the control plane is up, the dashboard loads, but does the cloud actually work? Can a tenant boot an instance, attach a volume, get a floating IP, and reach the internet? Clicking through that by hand is slow and you always miss something. OpenStack Tempest is the answer — the canonical integration test suite that exercises real API workflows against a live cloud and tells you, concretely, what is broken.

Tempest has hundreds of tests and a configuration file with a hundred knobs, and reading a failed test’s traceback is a skill in itself. So I run Tempest the obvious way these days: I let it produce the failures, then hand the tracebacks to an AI assistant to triage. The model is a fast junior engineer — brilliant at reading Python stack traces, with zero authority to decide whether a failure is acceptable to ship.

Setting Up Tempest

Tempest runs against credentials with enough rights to create test resources. The modern entry point is the tempest CLI with a workspace:

tempest init my-cloud-tests
cd my-cloud-tests

Then configure etc/tempest.conf with your auth URL, a test account, and which services are enabled. Generating that config by hand is tedious, and there is a discovery helper:

openstack endpoint list
discover-tempest-config --create

The endpoint list matters because Tempest must know which services actually exist — running volume tests against a cloud with no Cinder is just noise. I paste the endpoint list into an AI session and ask it to tell me which Tempest [service_available] flags to set. That mapping from endpoints to feature flags is exactly the mechanical translation a model nails.

Running a Targeted Subset

Do not run the whole suite first — that is a recipe for a four-hour, overwhelming wall of red. Start narrow:

tempest run --regex tempest.api.compute.servers
tempest run --regex tempest.api.volume

The --regex filter lets you validate one service at a time. I run the compute API tests after a Nova change, the volume tests after a Cinder change, and so on. This keeps the feedback loop tight and the failures interpretable.

Pro Tip: Use a smoke subset (--regex smoke) as your post-upgrade gate. It is a curated set of the most important cross-service workflows and runs in minutes, so you get a fast “is the cloud fundamentally working” signal before committing to the full suite.

Reading the Failures

Here is where AI genuinely changes the job. Tempest failures come with a full Python traceback and the failing API response. I list them:

tempest run --regex tempest.api.network 2>&1 | tee results.log

Then I paste the relevant failure block into Claude and ask: “Is this a real cloud bug, a Tempest misconfiguration, or a test that does not apply to my setup?” That triage question is the whole game. A huge fraction of Tempest “failures” are actually config mismatches — a test expecting a feature you deliberately disabled. The model is excellent at recognizing those, which saves me from chasing phantom bugs after every upgrade.

When a Tempest run reveals a genuine regression after an upgrade, I log the investigation in my incident response dashboard so the postmortem has the failing test and the fix in one place.

Telling Real Bugs from Noise

The discipline is categorizing every failure into one of three buckets, and being honest about it:

Real bug — a workflow that should work, does not. This blocks the upgrade.
Config/feature mismatch — the test exercises something you do not run. Skip it explicitly.
Flaky/environmental — timing or resource exhaustion in the test environment. Re-run before trusting.

I never let the AI’s categorization be final — it suggests the bucket, I confirm. A model can misjudge a real bug as “probably config,” and shipping on that would be exactly the kind of unaccountable mistake you cannot let an LLM make.

Listing and Choosing Tests

Before you can run the right subset, you have to know what exists. Tempest can enumerate its tests, and the list is long enough that browsing it by hand is a poor use of time:

tempest run --list-tests | grep volume

I pipe that list into an AI session and describe what I changed — “I just patched Cinder’s backup driver” — and ask which test classes are most relevant. The model maps my change to the test paths that actually exercise it, so I run a focused, meaningful subset instead of either the whole suite or a guess. This is the same grounding principle as everywhere else: I give it the real list of available tests rather than letting it invent test names from memory, because Tempest’s test paths are specific and a hallucinated one just produces an empty run. Choosing the right tests to run is as important as reading their results — a passing run of irrelevant tests proves nothing about the change you actually made.

Building a Regression Baseline

Once you have a clean run, save it. The next upgrade’s value is in the diff — which tests newly fail. I have AI draft a small wrapper script that runs the smoke subset, compares pass/fail counts against the saved baseline, and reports only the deltas. That script is reviewable, version-controlled code, so I run it through my code review dashboard before it becomes part of my upgrade ritual.

Guardrails

Tempest creates and deletes real resources, so it is not zero-risk — point it at a sandbox or a maintenance window, never a busy production tenant mid-day. My rules:

The AI drafts config, triages tracebacks, and writes wrapper scripts; it never holds production credentials or runs the suite itself.
Tempest runs against a dedicated test account with its own quota, not an admin god-credential.
Every failure categorization the model suggests gets a human confirm before it affects a go/no-go upgrade decision.

My vetted Tempest prompts live in the prompt workspace, reusable templates are in the OpenStack prompt pack, and I usually edit the config and wrapper scripts in Cursor. For lighter-weight local triage I sometimes use Gemma.

The Takeaway

Tempest answers the post-upgrade question — “does the cloud actually work?” — with evidence instead of vibes. Pairing it with an AI assistant that reads the tracebacks and sorts real bugs from config noise turns its biggest weakness, the wall of red, into a fast, actionable report. Let the model triage, keep the go/no-go call human, and you will trust your upgrades for the first time.

The habit that pays off is running Tempest before you need it, not just after a crisis. A smoke run on a quiet Friday establishes that your baseline is genuinely green, so the next time the suite lights up red you know the change you just made caused it — not some pre-existing condition you never noticed. Combined with an AI assistant that makes triage fast enough to do routinely, Tempest shifts from a scary once-a-year ritual to a regular health check. That regularity is what turns “I hope the upgrade worked” into “I have proof it did.”

Want a Tempest gate wired into your upgrade pipeline with a clean regression baseline? Work with me, or keep exploring the OpenStack guides and the prompt library.