Nova Scheduler Filter Analysis Prompt
Diagnose why VMs aren't landing on hosts — review scheduler filters, weighers, host aggregates, placement allocations, and capacity.
- Target user
- OpenStack platform engineers and capacity managers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack compute engineer with deep experience tuning the Nova scheduler and the Placement API across hundreds of compute hosts. I will provide: - The symptom: VMs fail to schedule (`NoValidHost`), schedule slowly, all land on one host, fail anti-affinity, fail PCI/NUMA, etc. - `nova-scheduler` log output for one or more failing requests (with request-id) - The current `[filter_scheduler]` config from `nova.conf` - Output from `openstack hypervisor list`, `openstack host show`, `openstack aggregate list` - Placement state: `openstack resource provider list`, `openstack resource provider inventory list <rp>`, `openstack resource provider usage show <rp>` Your job: 1. **Identify the failure stage** in the scheduling pipeline: - Pre-filter (cell or aggregate exclusion) - Placement (resource providers don't satisfy claim — VCPU, MEMORY_MB, DISK_GB, custom resource classes) - Filter pass (one of the enabled filters rejected all candidates) - Weigher tie-break landing on a "worse" host than expected - Race / claim retry exhausted 2. **For each candidate filter** (RamFilter, ComputeFilter, ServerGroupAntiAffinityFilter, AggregateInstanceExtraSpecsFilter, etc.) explain which one is most likely rejecting based on the log evidence. 3. **Check placement allocations** — are there allocations on dead hosts? (Common cause of phantom capacity.) 4. **Map host aggregates to flavor extra_specs** — does the requested flavor's `aggregate_instance_extra_specs:` match an existing aggregate metadata? 5. **Suggest the minimum diagnostic next step** (specific commands, specific host). 6. **Label DANGEROUS actions**: scheduler restart, disabling filters live, force-deleting placement allocations, `nova-manage placement heal_allocations` without dry-run. Common failure classes: - **`NoValidHost` with hosts visibly free** → placement allocations leaked from deleted/failed instances; need `nova-manage placement heal_allocations` - **All VMs land on one host** → weigher tuned heavily one direction (RAMWeigher with `ram_weight_multiplier` extreme), or aggregates restrict the others - **Anti-affinity fails on second VM** → server group exists but `ServerGroupAntiAffinityFilter` not in `enabled_filters` - **NUMA / PCI / hugepage VMs fail** → host trait/resource not modeled in placement, or numa_topology incorrect - **Slow scheduling (>10s)** → filter querying placement N times per host; check `placement-api` logs - **Race retries exhausted** → claim conflict on memory or vCPU; needs `scheduler_max_attempts` + investigation of conductor OpenStack release: [yoga / zed / antelope / bobcat / caracal / dalmatian / epoxy] Scheduler config: ```ini [PASTE [filter_scheduler] and [placement] sections] ``` Failing request log: ``` [PASTE] ``` Hypervisor/placement state: ``` [PASTE] ```
Why this prompt works
Nova scheduling failures span three subsystems: Nova-scheduler itself (filters + weighers), the Placement API (resource providers + allocations), and host aggregates (metadata-to-flavor matching). Models guess; this prompt forces them to walk each layer.
The single biggest cause of “NoValidHost even though hosts are free” in production is leaked Placement allocations — instances deleted from the Nova DB whose Placement allocation rows were never cleaned up. Until you know to look there, the cluster looks healthy.
How to use it
- Find the failing request ID from the Nova API log first, then
grepit throughnova-schedulerandnova-conductorlogs. - Always include the relevant flavor’s extra_specs:
openstack flavor show <flavor>. Aggregate-extra-specs matching is the second-most-common confusing failure. - Include
openstack resource provider list+inventory listfor at least 3 candidate hosts. The contrast is informative.
Useful commands
# Scheduler view
sudo journalctl -u nova-scheduler -n 200 --no-pager
sudo grep <request-id> /var/log/nova/*.log
# Placement view
openstack resource provider list
openstack resource provider inventory list <rp-uuid>
openstack resource provider usage show <rp-uuid>
openstack resource provider allocation show <consumer-uuid>
# Aggregates and flavors
openstack aggregate list --long
openstack aggregate show <agg>
openstack flavor show <flavor> # check extra_specs aggregate_instance_extra_specs:foo=bar
# Detect leaked allocations
nova-manage placement heal_allocations --dry-run
# Detect mismatches between Nova DB and Placement
nova-manage placement audit # release dependent
# Per-host investigation
openstack hypervisor show <hostname>
ssh <host> 'sudo systemctl status nova-compute && sudo journalctl -u nova-compute -n 100 --no-pager'
Common findings this catches
- Phantom full hypervisor:
openstack hypervisor showsays 90% RAM used, but only 4 of 64 VMs are present. Placement has 50 leaked allocations from old instances. - Anti-affinity policy silently downgraded to “soft-anti-affinity” because
ServerGroupAntiAffinityFilternot enabled — VMs co-locate. - PCI passthrough VMs fail — host has the PCI device but it’s not exposed as a custom resource class in Placement (
CUSTOM_PCI_NVIDIA_A100). - All new VMs land on host-01 —
host_subset_size = 1(deterministic placement) instead of randomized. - Filter rejects “everything” —
AggregateInstanceExtraSpecsFilteris on, but no aggregate has the metadata your flavor requires.
When to escalate
- Anything involving
nova-manage placement heal_allocationson a busy cluster without a maintenance window — discuss with team first. - Changes to
enabled_filters— coordinate with capacity & quota team; scheduling behavior changes can silently shift VM density. - Direct Placement allocation deletion — confirm the corresponding instance is truly gone in the Nova DB.
Related prompts
-
OpenStack Request-ID Log Trace Prompt
Correlate a single API request across services (nova-api → conductor → scheduler → compute → neutron → cinder) using OpenStack request IDs.
-
OpenStack Upgrade Pre-Flight Review Prompt
Pre-upgrade safety review of an OpenStack cluster moving release N → N+1 — config drift, deprecated options, DB migrations, breaking changes, service ordering.
-
OpenStack VM Troubleshooting Prompt
Diagnose Nova VM boot failures, networking issues, and stuck instances using nova/openstack CLI output.