Using AI to Debug a Nova Scheduler That Won't Place

I have lost count of how many times a developer has pinged me at 2 a.m. with a screenshot of No valid host was found and the unspoken expectation that I will fix it before standup. After a decade of running Nova at scale, I have learned that NoValidHost is rarely one bug. It is a chain: a filter rejected every candidate, and the scheduler dutifully gave up. The trick is figuring out which filter, and why. These days I let an AI assistant help me move through that chain faster, but I treat it exactly like a sharp junior engineer who joined last week: quick, confident, and absolutely not allowed near my admin credentials.

This post is how I actually do it.

Read the scheduler log first, always

NoValidHost is a symptom. The cause is buried in nova-scheduler.log, and if you enable debug logging you get a per-filter breakdown of how many hosts survived each stage.

journalctl -u devstack@n-sch -f
# or on packaged deployments:
sudo tail -f /var/log/nova/nova-scheduler.log

You are hunting for lines like Filter ComputeFilter returned 0 hosts or Filtering removed all hosts for the request. That single line tells you the culprit filter. I will often paste a sanitized chunk of that log into Claude and ask it to summarize which filter killed the request and what that filter checks. It is genuinely good at this — faster than me grepping the Nova source. But I read the raw log myself too, because the AI will happily invent a plausible-sounding filter that does not exist in your release.

Pro Tip: Set debug = True in nova.conf on the scheduler only, reproduce once, then turn it back off. Leaving debug on in production fills disks and buries the signal you actually need.

Know your filters before you ask anyone — human or AI

The default enabled_filters list matters. The usual suspects:

ComputeFilter — rejects hosts whose nova-compute service is down or disabled.
RamFilter / MemoryFilter — host lacks free RAM after the allocation ratio is applied.
AggregateInstanceExtraSpecsFilter — flavor extra_specs do not match host aggregate metadata.
PciPassthroughFilter — no host has the requested PCI alias / device available.

When I describe a symptom to an AI, I give it my actual enabled_filters value. Without that context it guesses defaults, and modern deployments running placement-driven scheduling have a very different filter set than a Mitaka cloud did. Context in, accuracy out. I keep a few of these scoping prompts in my prompt workspace so I am not retyping them at 2 a.m.

Check the obvious: are your computes even alive?

Half of all NoValidHost tickets are a disabled or dead compute service. ComputeFilter silently removes those hosts, so you see zero candidates with no obvious reason.

openstack compute service list --service nova-compute
openstack compute service list --long

Look at the State (up/down) and Status (enabled/disabled) columns. Someone disabled a host for maintenance three weeks ago and never re-enabled it:

openstack compute service set --enable compute-04 nova-compute

This is the kind of dull, mechanical correlation an AI is great at: feed it the table output, ask “which hosts are both up and enabled,” and it will diff it against your hypervisor list instantly. Just verify the answer against the raw table — never act on the summary alone.

Confirm real capacity with hypervisor stats

If the services are healthy, the next question is whether you actually have resources. Aggregate stats lie less than gut feeling.

openstack hypervisor list
openstack hypervisor stats show
openstack hypervisor show compute-04

hypervisor stats show gives you cluster-wide vcpus_used, memory_mb_used, and free counts. If memory_mb minus memory_mb_used is smaller than your flavor needs once the RAM allocation ratio is applied, RamFilter is your answer. Remember overcommit: a host with 256 GB and a 1.5 allocation ratio advertises 384 GB to the scheduler. The AI will remind you of this if you forget — that is the “fast junior” value — but the actual ratio lives in your nova.conf, so check it.

Move to placement: the real source of truth

Since Nova switched to the Placement service for resource tracking, a lot of NoValidHost cases are really “no allocation candidates.” Placement, not the legacy scheduler stats, decides what is available.

openstack resource provider list
openstack resource provider inventory list <provider-uuid>
openstack resource provider usage show <provider-uuid>

Then ask Placement directly whether any candidate can satisfy the request:

openstack allocation candidate list \
  --resource VCPU=4 \
  --resource MEMORY_MB=8192 \
  --resource DISK_GB=40

If that returns nothing, the scheduler never even got a list of hosts to filter — Placement already said no. This distinction trips up a lot of people, and it is where I have caught AI assistants being confidently wrong: they conflate the legacy filter scheduler with Placement. If your assistant insists the problem is RamFilter but allocation candidate list returns empty for raw resource classes, the AI is behind the times. Trust the API, not the chatbot.

Pro Tip: Add --resource CUSTOM_<TRAIT> or --required <TRAIT> to your candidate query to reproduce trait-based scheduling. If the unconstrained query returns candidates but the constrained one does not, your required traits are the problem, not capacity.

The classic: flavor extra_specs vs. aggregate metadata

This is the bug that eats the most hours, because everything looks healthy. You have capacity, services are up, but AggregateInstanceExtraSpecsFilter quietly drops every host because a flavor demands metadata no aggregate provides.

openstack flavor show m1.gpu
openstack aggregate list
openstack aggregate show gpu-hosts

Compare the flavor’s properties (the extra_specs) against the aggregate metadata. A flavor asking for aggregate_instance_extra_specs:gpu_type='a100' will match nothing if your aggregate is tagged gpu_type=a100 without the namespace, or if a typo crept in. I paste both blobs into an AI and ask it to diff the keys — it catches the trailing-space and wrong-namespace mistakes faster than my tired eyes. That is a perfect use of the tool: a fast, tireless diff. It is also why I keep a small library of troubleshooting prompts tuned for exactly this comparison.

For PCI passthrough, the same logic applies to pci_passthrough:alias. Confirm the alias is defined on a compute host and that PciPassthroughFilter is in enabled_filters, or the flavor will request a device the scheduler cannot honor.

Where AI helps, and the hard line I never cross

My honest assessment after running this workflow for a while: AI is a force multiplier on the interpretation layer. Pasting log lines, diffing extra_specs, recalling what an obscure filter checks, drafting the allocation candidate list query — all faster with a good model. Tools like Cursor or Warp make it convenient to keep that loop tight inside the terminal where I am already working.

But I never give an assistant my clouds.yaml, my admin token, or shell access to the control plane. It suggests commands; I run them, after reading them. When an incident escalates I route it through our incident response workflow where there is a human approval gate, not an autonomous agent firing openstack calls at production. The AI is a junior engineer pair, not an operator. It does not get the keys.

Conclusion

NoValidHost is a process of elimination: services up, capacity real, placement says yes, traits match, extra_specs align. Walk that chain every time and the answer falls out. Let AI compress the boring correlation work — log summaries, table diffs, query drafting — but keep your hands on the credentials and your eyes on the raw output. The fastest debugging sessions I have now are the ones where I treat the model as a quick assistant and myself as the one who is still accountable. For more on building that habit across your stack, the OpenStack category collects the rest of this series.

Using AI to Debug a Nova Scheduler That Won't Place Instances