OpenStack Error Guide: 'Exceeded maximum number of retries'

Overview

“Exceeded maximum number of retries” means Nova tried to build an instance, the spawn failed on a host, it rescheduled to another host, that one failed too, and after max_attempts tries it ran out of hosts to try. The instance goes to ERROR.

You will see this in the instance fault or nova-conductor log:

Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 7c9e2f1a-44bb-4c0d-9e21-aa1122334455.

The crucial point: this message is a symptom of repeated rescheduling, not a root cause. Unlike “Build of instance aborted” (a single terminal failure), this error means each attempt failed for a reason Nova considered retryable, and you must dig into the per-host failures to find the real problem. Often every retry fails for the same underlying reason on different hosts (RPC timeout, transient bind failure), which makes it look like a cluster-wide outage.

Symptoms

Instance ends in ERROR with fault “Exceeded maximum number of retries. Exhausted all hosts…”.
The scheduler picked and tried multiple hosts (you can see them in the retry list).
Each attempt produced a RescheduledException on a different compute, not a BuildAbortException.
Small clusters fail fast (few hosts to try); large clusters burn through max_attempts then stop.

openstack server show batch-09 -c fault -f value

{'code': 500, 'message': 'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 7c9e2f1a-...', ...}

# Kolla-Ansible (controller)
docker logs nova_scheduler 2>&1 | grep "7c9e2f1a-" | tail -20
# Traditional
sudo journalctl -u nova-scheduler --no-pager | grep "7c9e2f1a-" | tail -20

Common Root Causes

1. max_attempts is too low for a transient issue

[scheduler] max_attempts (default 3) caps how many hosts Nova will try. With a transient, fleet-wide hiccup, 3 attempts exhaust quickly. Raising it masks the real problem, but knowing the value frames the failure.

grep -E '^\s*max_attempts' /etc/nova/nova.conf
# Kolla-Ansible
docker exec nova_scheduler grep -E 'max_attempts' /etc/nova/nova.conf

max_attempts = 3

2. RetryFilter / per-host retry behavior

The scheduler tracks already-tried hosts in RequestSpec.retry and excludes them on the next pass (historically the RetryFilter; in newer releases this is built into the scheduler). If every remaining host also fails, the list of candidates empties and you exhaust retries.

grep -E '^\s*enabled_filters' /etc/nova/nova.conf

enabled_filters = AvailabilityZoneFilter,ComputeFilter,ComputeCapabilitiesFilter,ImagePropertiesFilter,ServerGroupAntiAffinityFilter,ServerGroupAffinityFilter

3. Transient port binding failures on spawn

If Neutron binding is flaky (an agent flapping, neutron-server overloaded), each host’s spawn can hit a binding_failed/vif timeout, raise RescheduledException, and move on — until hosts run out. The retry message hides a binding problem.

docker logs nova_compute 2>&1 | grep -iE "binding_failed|VirtualInterfaceCreateException" | tail -10

4. RPC timeouts during spawn

An overloaded message queue (RabbitMQ) or slow conductor causes MessagingTimeout during the build, which Nova treats as retryable. Every host times out the same way.

docker logs nova_compute 2>&1 | grep -iE "MessagingTimeout|Timed out waiting for a reply" | tail -10

5. Anti-affinity with too few hosts

A ServerGroupAntiAffinityFilter group that needs N distinct hosts but the cluster has fewer than N available will reschedule until candidates are exhausted, with each placement rejected by the anti-affinity rule.

openstack server group show <GROUP_ID> -c policy -c members -f value

anti-affinity
['inst-a', 'inst-b']

6. All suitable computes are full or disabled

If every host that passes the flavor’s filters is at capacity or nova-compute is disabled/down, each retry either finds no host or fails the claim, and you exhaust attempts.

openstack compute service list --service nova-compute
openstack hypervisor list --long

| compute-01 | nova-compute | nova | enabled | down |
| compute-02 | nova-compute | nova | disabled| up   |

Diagnostic Workflow

Step 1: Confirm exhaustion and grab the instance UUID

openstack server show <SERVER> -c fault -f value

The “Exhausted all hosts” wording confirms repeated rescheduling. Note the instance UUID.

Step 2: List the hosts that were tried

# Kolla-Ansible (controller)
docker logs nova_conductor 2>&1 | grep "<INSTANCE_UUID>" | grep -iE "retry|reschedul" | tail -20
# Traditional
sudo journalctl -u nova-conductor --no-pager | grep "<INSTANCE_UUID>" | grep -iE "retry|reschedul" | tail -20

WARNING nova.scheduler.utils Setting instance to ERROR state; ... [instance: 7c9e2f1a-...] retry: {'num_attempts': 3, 'hosts': [['compute-01', ...], ['compute-04', ...], ['compute-07', ...]]}

This gives you the exact hosts to inspect.

Step 3: Read the REAL per-host failure on each tried host

The retry message never tells you why. Pull nova-compute on the tried hosts and find the RescheduledException cause:

for h in compute-01 compute-04 compute-07; do
  echo "=== $h ==="
  ssh $h "docker logs nova_compute 2>&1 | grep '<INSTANCE_UUID>' | tail -15"
done

=== compute-01 ===
ERROR nova.compute.manager [instance: 7c9e2f1a-...] Instance failed to spawn: MessagingTimeout: Timed out waiting for a reply to message ID ...

If every host shows the same cause, that is your real problem.

Step 4: Check the suspected shared dependency

# RabbitMQ / RPC health
openstack network agent list   # if binding-related
docker exec rabbitmq rabbitmqctl list_queues name messages consumers | sort -k2 -n | tail
# Capacity / service state
openstack compute service list --service nova-compute

Step 5: Fix the shared cause, then retry the build

Address the per-host failure (restart a flapping agent, drain RabbitMQ backlog, enable/free computes), then recreate or rebuild:

openstack server delete <SERVER>
openstack server create --flavor <F> --image <I> --network <N> <SERVER>
# Only raise max_attempts if the failure is genuinely transient:
# [scheduler] max_attempts = 5

Example Root Cause Analysis

A nightly batch job fires 40 instances; about a third land in ERROR with:

Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 3b1c... .

The conductor log shows each failed instance tried three different hosts, so it is not one bad node. Pulling nova-compute on the tried hosts shows the same line everywhere:

ERROR nova.compute.manager [instance: 3b1c-...] Instance failed to spawn: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 9d2f...

Every host times out the same way during spawn, which points at the message bus, not the computes. Checking RabbitMQ during the batch window:

docker exec rabbitmq rabbitmqctl list_queues name messages | sort -k2 -n | tail -5

reply_8f3a...    14021

The reply queue backed up tens of thousands of messages because the 40 simultaneous builds overwhelmed a single RabbitMQ node. Each spawn RPC timed out, Nova rescheduled, the next host timed out too, and attempts exhausted. Fix: stagger the batch and scale RabbitMQ:

# Throttle concurrency in the batch tool, and on the controllers:
# [DEFAULT] rpc_response_timeout = 120
docker restart rabbitmq    # after clustering / adding a node

With the bus no longer saturated, builds spawn on the first host and stop exhausting retries.

Prevention Best Practices

Treat this error as “find the per-host cause”, never as “raise max_attempts” — bumping retries just delays the same exhaustion.
Monitor RabbitMQ queue depth and rpc_response_timeout; bursty parallel builds are the classic trigger for spawn-time RPC timeouts.
Alert on Neutron agent liveness so flapping agents do not turn into fleet-wide reschedule storms.
Right-size anti-affinity groups: never request more distinct hosts than the AZ actually has available.
Keep enough headroom and watch for disabled/down nova-compute services so the candidate pool never collapses under load.
For fast triage, the free incident assistant can correlate the tried hosts and surface the common per-host failure. See more in OpenStack guides.

Quick Command Reference

# Confirm exhaustion
openstack server show <SERVER> -c fault -f value

# Which hosts were tried?
docker logs nova_conductor 2>&1 | grep "<INSTANCE_UUID>" | grep -iE "retry|reschedul" | tail -20
sudo journalctl -u nova-conductor | grep "<INSTANCE_UUID>" | grep -i retry | tail -20

# The REAL per-host failure
docker logs nova_compute 2>&1 | grep "<INSTANCE_UUID>" | tail -15

# Shared-dependency health
docker exec rabbitmq rabbitmqctl list_queues name messages consumers | sort -k2 -n | tail
openstack network agent list
openstack compute service list --service nova-compute

# Config that frames the failure
grep -E 'max_attempts|enabled_filters|rpc_response_timeout' /etc/nova/nova.conf

# Retry the build after fixing the cause
openstack server delete <SERVER> && openstack server create ...

Conclusion

“Exceeded maximum number of retries. Exhausted all hosts” is a summary, not a root cause: Nova rescheduled across multiple hosts and every attempt failed for a reason it considered retryable.

max_attempts caps how many hosts are tried before exhaustion.
The scheduler excludes already-tried hosts, so the candidate list can empty out.
Transient port binding failures make each host fail the same way.
RPC/MessagingTimeout during spawn (overloaded RabbitMQ) is a frequent cluster-wide cause.
Anti-affinity groups larger than the available host count.
All filter-passing computes full, disabled, or down.

Pull the list of tried hosts from the conductor log, read the actual per-host RescheduledException cause, and fix the shared dependency — not the retry count.

OpenStack Error Guide: 'Exceeded maximum number of retries' Nova Scheduler Exhaustion