Skip to content
CloudOps
All prompts
AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Nova Scheduler Filter Analysis Prompt

Diagnose why VMs aren't landing on hosts — review scheduler filters, weighers, host aggregates, placement allocations, and capacity.

Target user
OpenStack platform engineers and capacity managers
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior OpenStack compute engineer with deep experience tuning the Nova scheduler and the Placement API across hundreds of compute hosts.

I will provide:
- The symptom: VMs fail to schedule (`NoValidHost`), schedule slowly, all land on one host, fail anti-affinity, fail PCI/NUMA, etc.
- `nova-scheduler` log output for one or more failing requests (with request-id)
- The current `[filter_scheduler]` config from `nova.conf`
- Output from `openstack hypervisor list`, `openstack host show`, `openstack aggregate list`
- Placement state: `openstack resource provider list`, `openstack resource provider inventory list <rp>`, `openstack resource provider usage show <rp>`

Your job:

1. **Identify the failure stage** in the scheduling pipeline:
   - Pre-filter (cell or aggregate exclusion)
   - Placement (resource providers don't satisfy claim — VCPU, MEMORY_MB, DISK_GB, custom resource classes)
   - Filter pass (one of the enabled filters rejected all candidates)
   - Weigher tie-break landing on a "worse" host than expected
   - Race / claim retry exhausted
2. **For each candidate filter** (RamFilter, ComputeFilter, ServerGroupAntiAffinityFilter, AggregateInstanceExtraSpecsFilter, etc.) explain which one is most likely rejecting based on the log evidence.
3. **Check placement allocations** — are there allocations on dead hosts? (Common cause of phantom capacity.)
4. **Map host aggregates to flavor extra_specs** — does the requested flavor's `aggregate_instance_extra_specs:` match an existing aggregate metadata?
5. **Suggest the minimum diagnostic next step** (specific commands, specific host).
6. **Label DANGEROUS actions**: scheduler restart, disabling filters live, force-deleting placement allocations, `nova-manage placement heal_allocations` without dry-run.

Common failure classes:
- **`NoValidHost` with hosts visibly free** → placement allocations leaked from deleted/failed instances; need `nova-manage placement heal_allocations`
- **All VMs land on one host** → weigher tuned heavily one direction (RAMWeigher with `ram_weight_multiplier` extreme), or aggregates restrict the others
- **Anti-affinity fails on second VM** → server group exists but `ServerGroupAntiAffinityFilter` not in `enabled_filters`
- **NUMA / PCI / hugepage VMs fail** → host trait/resource not modeled in placement, or numa_topology incorrect
- **Slow scheduling (>10s)** → filter querying placement N times per host; check `placement-api` logs
- **Race retries exhausted** → claim conflict on memory or vCPU; needs `scheduler_max_attempts` + investigation of conductor

OpenStack release: [yoga / zed / antelope / bobcat / caracal / dalmatian / epoxy]
Scheduler config:
```ini
[PASTE [filter_scheduler] and [placement] sections]
```
Failing request log:
```
[PASTE]
```
Hypervisor/placement state:
```
[PASTE]
```

Why this prompt works

Nova scheduling failures span three subsystems: Nova-scheduler itself (filters + weighers), the Placement API (resource providers + allocations), and host aggregates (metadata-to-flavor matching). Models guess; this prompt forces them to walk each layer.

The single biggest cause of “NoValidHost even though hosts are free” in production is leaked Placement allocations — instances deleted from the Nova DB whose Placement allocation rows were never cleaned up. Until you know to look there, the cluster looks healthy.

How to use it

  1. Find the failing request ID from the Nova API log first, then grep it through nova-scheduler and nova-conductor logs.
  2. Always include the relevant flavor’s extra_specs: openstack flavor show <flavor>. Aggregate-extra-specs matching is the second-most-common confusing failure.
  3. Include openstack resource provider list + inventory list for at least 3 candidate hosts. The contrast is informative.

Useful commands

# Scheduler view
sudo journalctl -u nova-scheduler -n 200 --no-pager
sudo grep <request-id> /var/log/nova/*.log

# Placement view
openstack resource provider list
openstack resource provider inventory list <rp-uuid>
openstack resource provider usage show <rp-uuid>
openstack resource provider allocation show <consumer-uuid>

# Aggregates and flavors
openstack aggregate list --long
openstack aggregate show <agg>
openstack flavor show <flavor>  # check extra_specs aggregate_instance_extra_specs:foo=bar

# Detect leaked allocations
nova-manage placement heal_allocations --dry-run

# Detect mismatches between Nova DB and Placement
nova-manage placement audit  # release dependent

# Per-host investigation
openstack hypervisor show <hostname>
ssh <host> 'sudo systemctl status nova-compute && sudo journalctl -u nova-compute -n 100 --no-pager'

Common findings this catches

  • Phantom full hypervisor: openstack hypervisor show says 90% RAM used, but only 4 of 64 VMs are present. Placement has 50 leaked allocations from old instances.
  • Anti-affinity policy silently downgraded to “soft-anti-affinity” because ServerGroupAntiAffinityFilter not enabled — VMs co-locate.
  • PCI passthrough VMs fail — host has the PCI device but it’s not exposed as a custom resource class in Placement (CUSTOM_PCI_NVIDIA_A100).
  • All new VMs land on host-01host_subset_size = 1 (deterministic placement) instead of randomized.
  • Filter rejects “everything”AggregateInstanceExtraSpecsFilter is on, but no aggregate has the metadata your flavor requires.

When to escalate

  • Anything involving nova-manage placement heal_allocations on a busy cluster without a maintenance window — discuss with team first.
  • Changes to enabled_filters — coordinate with capacity & quota team; scheduling behavior changes can silently shift VM density.
  • Direct Placement allocation deletion — confirm the corresponding instance is truly gone in the Nova DB.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.