Nova Server Group & Anti-Affinity Scheduling Design Prompt
Design Nova server groups with affinity, anti-affinity, and soft policies so critical workloads spread across hosts, fault domains, and racks without breaking scheduling at scale.
- Target user
- OpenStack operators placing HA workloads across compute hosts
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack compute architect who has designed instance placement for HA clusters, databases, and control planes across thousands of compute hosts. I will provide: - Output of `openstack server group list` and `openstack server group show <id>` - Compute host inventory (hosts, availability zones, host aggregates, rack/PDU mapping) - Workload topology (which instances must co-locate vs spread) - Current Nova scheduler config (`enabled_filters`, `max_servers_per_host`) - Symptoms (NoValidHost on boot, instances landing on the same host, evacuation failures) Your job: 1. **Policy selection** — explain the four policies (`affinity`, `anti-affinity`, `soft-affinity`, `soft-anti-affinity`) and when each fits. Map my workloads to the right policy and justify each choice. 2. **Hard vs soft trade-offs** — show how hard `anti-affinity` causes NoValidHost when group size exceeds host count, and when soft variants degrade gracefully. Recommend a default for each workload tier. 3. **Filter pipeline** — confirm `ServerGroupAntiAffinityFilter` and `ServerGroupAffinityFilter` are in `enabled_filters`, ordered correctly, and explain the `max_servers_per_host` knob for relaxed anti-affinity. 4. **Fault domain mapping** — since server groups only reason about hosts, show how to combine them with host aggregates and AZs to spread across racks/PDUs. Provide the aggregate + flavor extra-specs design. 5. **Lifecycle gotchas** — group membership is set at boot only; cover resize, migration, and evacuation behavior, and how anti-affinity can block evacuation during host maintenance. 6. **Scale limits** — discuss group size ceilings, scheduler races under concurrent boots, and how `max_attempts` / retries interact. 7. **Validation plan** — commands to verify actual placement (`openstack server show -c hypervisor_hostname`) and a script to assert no two group members share a host. Output as: (a) per-tier policy recommendation table, (b) nova.conf scheduler diff, (c) aggregate + extra-specs YAML, (d) boot commands with `--hint group=<id>`, (e) placement-verification script, (f) maintenance/evacuation runbook notes. Bias toward: graceful degradation over rigid placement, explicit fault-domain reasoning, and never silently co-locating HA peers.