Skip to content
DevOps AI ToolKit
Newsletter
OpenStack Troubleshooting

Cinder Scheduler Timeout: Diagnose & Fix

A cinder-scheduler MessagingTimeout means an RPC hop in the Cinder volume path never got its reply — a dead cinder-volume, a backed-up RabbitMQ, or a backend that can't report capabilities. This runbook walks the cinder-api → scheduler → volume path with copy-paste commands so you can tell a real timeout apart from a 'filtering removed all hosts' capacity problem and fix the right hop.

Updated July 3, 2026 11 min read Runbook-style guide · copy/paste commands

Free runbook · PDF

Download the free RabbitMQ RPC Timeout Runbook Pack

A copy/paste runbook for oslo.messaging timeouts and missed heartbeats — cluster health, queue depth, and a service restart decision tree.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

When you create a volume, cinder-api validates the request and RPC-casts it onto RabbitMQ for cinder-scheduler. The scheduler runs its filters and weighers over every backend it knows about, picks the best cinder-volume host, and RPC-casts the create down to that host, which drives the actual storage backend (LVM, Ceph, NetApp, and friends). A cinder scheduler timeout — an oslo.messaging MessagingTimeout raised in the scheduler or API — means one of those RPC hops asked for a reply and never got it in time. The usual culprits are a cinder-volume that is down or disconnected, a RabbitMQ queue that is backed up or has no consumer, or a backend array so slow that cinder-volume blocks and never answers.

Before you start, separate two failure modes that look similar in the CLI but have opposite root causes. A MessagingTimeout is a messaging/availability failure: the request could not reach a healthy service to be scheduled at all. "No valid backend found" / "filtering removed all hosts" is the opposite — the scheduler did run and every candidate backend was rejected by a filter, so it is a capacity or capabilities problem. Chasing RabbitMQ when the real issue is an exhausted backend (or vice versa) is how a 20-minute incident becomes a two-hour one. This guide is the runbook I use to tell them apart fast; want it in your hand during the incident? Grab the free runbook pack above.

Symptoms

You are probably here because you are seeing one or more of these:

  • New volumes hang in creating (and never reach available), or land in error after a long wait.
  • cinder-scheduler logs MessagingTimeout: Timed out waiting for a reply to message ID ....
  • openstack volume service list shows a cinder-volume as down with a stale updated_at.
  • cinder get-pools --detail hangs and eventually times out instead of returning your pools.
  • The API returns No valid backend found, and the scheduler log reads Filter ... returned 0 hosts / filtering removed all hosts.

Likely causes

In production OpenStack, an openstack cinder scheduler timeout almost always traces back to one of these:

  • cinder-volume down or disconnected. If the volume service isn’t running (or lost RabbitMQ), it stops sending capability reports, the scheduler’s view of that backend goes stale, and RPC casts to it time out.
  • RabbitMQ RPC backlog or heartbeats. A backed-up cinder-scheduler/cinder-volume queue, a dead consumer, or missed heartbeats stalls the reply — the classic cinder scheduler rpc timeout.
  • Backend storage slow or unreachable. A degraded Ceph cluster, a full LVM VG, or a NetApp/SAN that is not answering makes cinder-volume block on the driver, so it can’t reply to the scheduler or publish capabilities.
  • Scheduler filters too strict / capacity exhausted. When every backend fails a filter you get filtering removed all hosts — a capacity or extra-specs mismatch, not a timeout.
  • Capabilities not published. If report_interval is too long relative to service_down_time, or reports are dropped, the scheduler treats live backends as down.

Immediate checks

Sixty seconds of triage tells you which of the two failure modes you’re in. Start with the service list — is the volume backend even up, and how fresh is its heartbeat?

Is cinder-volume up, and how stale is its heartbeat?
# Status + Updated At: a 'down' state or an old timestamp is your lead
openstack volume service list

# Same view from the cinder CLI (shows the host@backend and State/Updated_At)
cinder service-list

A cinder-volume in state down with a stale Updated At means it stopped reporting — the scheduler has no fresh backend state and will time out. If every service is up and recent, suspect a filter/capacity problem instead.

Now read what the scheduler and volume services are actually logging. The distinction between a MessagingTimeout and a filtering removed all hosts line is the whole ballgame:

Tail the scheduler and volume logs for the deciding line
# MessagingTimeout => availability/RPC problem;  'returned 0 hosts' => capacity/filter problem
docker logs --tail=120 cinder_scheduler 2>&1 \
  | grep -Ei "MessagingTimeout|No valid backend|returned 0 hosts|filtering removed"

# Driver/backend errors on the volume side (blocked on the array => can't reply)
docker logs --tail=120 cinder_volume 2>&1 \
  | grep -Ei "error|timeout|traceback|unable to|refused"

Let the log pick your path: a MessagingTimeout sends you to the RabbitMQ + volume-service checks; a 'returned 0 hosts' line sends you to the capacity/capabilities checks.

Diagnostic commands

RabbitMQ RPC timeout checks

Cinder queues, consumers, and heartbeats
# Scheduler/volume queues: messages climbing with consumers=0 means nobody is draining them
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
  | awk 'NR==1 || $1 ~ /cinder-scheduler|cinder-volume/ {print}'

# Blocked connections (memory/disk watermark backpressure) and heartbeat drops
docker exec rabbitmq rabbitmqctl list_connections state | grep -c blocked
docker logs --tail=100 rabbitmq 2>&1 | grep -Ei "missed heartbeats|partition|closing"

consumers=0 on a cinder-volume queue means the volume service isn't subscribed — restart it so it re-consumes. Blocked connections mean RabbitMQ is applying backpressure. See the RabbitMQ RPC timeout guide for the full broker workup.

Volume backend health

Is the backend driver actually reachable?
# What backend did cinder-volume try to talk to, and did the driver error?
docker logs --tail=200 cinder_volume 2>&1 \
  | grep -Ei "backend|driver|ceph|rbd|lvm|netapp|failed|timed out"

# Ceph backend: is the cluster healthy and responsive?
docker exec ceph_mon ceph -s 2>/dev/null || ceph -s

# LVM backend: does the volume group exist and have free extents?
vgs; vgdisplay cinder-volumes 2>/dev/null | grep -Ei "free|VG Size"

A cinder-volume that is 'up' but blocked on a slow or unreachable array will still fail to reply to the scheduler. Fix the storage layer first — restarting cinder-volume against a dead backend just moves the hang.

cinder get-pools troubleshooting

What a good vs timing-out get-pools looks like
# Healthy: returns pools with capacity fields promptly (a second or two)
cinder get-pools --detail

# Time it — a hang here means the scheduler can't assemble fresh capability data
time cinder get-pools --detail

# Look for name, free_capacity_gb, total_capacity_gb, allocated_capacity_gb per pool

get-pools queries the scheduler, which needs fresh capability reports from every cinder-volume. A prompt reply lists your pools with capacity fields; a hang/timeout means a backend is unreachable or the scheduler is blocked on RPC — go back to the service list and RabbitMQ checks.

Scheduler filters

Which filters are active, and which one removed the hosts?
# Default filters if unset: AvailabilityZoneFilter, CapacityFilter, CapabilitiesFilter
docker exec cinder_scheduler grep -E "scheduler_default_filters|scheduler_default_weighers" \
  /etc/cinder/cinder.conf

# The scheduler names the filter that zeroed out the candidates
docker logs --tail=200 cinder_scheduler 2>&1 \
  | grep -Ei "Filter .* returned 0 hosts|filtering removed all hosts|passing candidates"

If CapacityFilter returned 0 hosts you're out of space; if CapabilitiesFilter did, a volume type's extra-specs don't match any backend; if AvailabilityZoneFilter did, no backend serves the requested AZ.

Backend capacity & capabilities

Free capacity, reservations, and over-subscription
# Per-pool capacity the scheduler is weighing on
cinder get-pools --detail \
  | grep -Ei "name|free_capacity_gb|total_capacity_gb|reserved_percentage|max_over_subscription"

# Volume type extra-specs that must match a backend's reported capabilities
openstack volume type list
openstack volume type show <type> -f value -c properties

'filtering removed all hosts' with healthy services almost always means free_capacity_gb is exhausted, reserved_percentage is eating the headroom, thin over-subscription is maxed, or the requested volume-type extra-specs match no backend.

Kolla-Ansible container checks

Are the Cinder containers actually running?
# Are cinder_scheduler / cinder_volume / cinder_api up, or restarting?
docker ps --filter "name=cinder" --format "table {{.Names}}	{{.Status}}"

# Recent restarts or crash loops leave a trail in each container's log
docker logs --tail=60 cinder_scheduler 2>&1 | tail -n 20
docker logs --tail=60 cinder_volume 2>&1 | tail -n 20

A cinder_volume container stuck in a restart loop (short, resetting uptime in Status) explains a scheduler timeout on its own — the service never stays up long enough to publish capabilities.

Fix & remediation steps

Map the deciding log line to the smallest safe remediation:

  • Scheduler can’t see the backend (MessagingTimeout, service down) → restart the consumer, cinder-volume, first so it re-registers and re-publishes capabilities.
  • Scheduler holding stale state after the volume recovers → restart cinder-scheduler so it rebuilds its host/pool view.
  • RabbitMQ consumer dead (consumers=0) → restarting cinder-volume re-subscribes it; treat the broker itself as a last resort.
  • "filtering removed all hosts" / capacity → do not restart anything. Add capacity, lower reserved_percentage, fix the volume-type extra-specs, or relax the offending filter.
Least-blast-radius restart (Kolla-Ansible)
# 1) Restart the consumer FIRST so it re-registers + re-publishes capabilities
docker restart cinder_volume

# 2) Only if the scheduler still holds stale backend state, restart it
docker restart cinder_scheduler

# 3) Confirm the service came back up and is heart-beating again
openstack volume service list

Restart cinder-volume before cinder-scheduler when the scheduler can't see the backend. If it's a filter/capacity problem, fix capacity or the filter instead — a restart won't conjure free space.

Free runbook · PDF

Grab the copy/paste version of this runbook

The RabbitMQ RPC Timeout Runbook Pack bundles the messaging-side commands behind this page — cluster health, queue depth, consumer and heartbeat checks, and a service restart decision tree — in one print-ready PDF that covers Cinder, Nova, Neutron, and Heat.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

Validation steps

Don’t declare victory on the restart alone — confirm the whole path is healthy again:

  • openstack volume service list shows the cinder-volume as up with a fresh updated_at.
  • cinder get-pools --detail returns promptly with real capacity fields (no hang).
  • The scheduler log no longer prints MessagingTimeout or returned 0 hosts on new requests.
  • An end-to-end create+delete of a small test volume succeeds and the scheduler picks a backend.
Post-fix validation
openstack volume service list -f value -c Binary -c Status | sort | uniq -c
time cinder get-pools --detail >/dev/null && echo "get-pools OK"

# Prove the full API -> scheduler -> volume path end to end
openstack volume create --size 1 sched-check
openstack volume show sched-check -f value -c status -c os-vol-host-attr:host
openstack volume delete sched-check

A 1 GiB volume that reaches 'available' with a host@backend#pool assigned confirms the scheduler picked a live backend over the RPC path. Clean it up so you don't leave test volumes behind.

Prevention

Want the always-current prompts and tools behind this workflow? Browse the AI prompt library, the free in-browser DevOps tools, and — when a production storage incident needs senior hands — work with me directly.

Free runbook · PDF

Download the free RabbitMQ RPC Timeout Runbook Pack

A copy/paste runbook for oslo.messaging timeouts and missed heartbeats — cluster health, queue depth, and a service restart decision tree.

  • OpenStack RPC timeout checklist
  • RabbitMQ cluster + queue depth commands
  • Consumer / publisher + heartbeat checks
  • oslo.messaging config review
  • Nova/Cinder/Neutron/Heat symptom matrix
  • Service restart decision tree + notes template

No account needed · single opt-in · we never share your email.

Frequently asked questions

What causes a cinder scheduler timeout?
An oslo.messaging MessagingTimeout in cinder-scheduler means an RPC hop didn’t get a reply in time. In practice that is almost always a cinder-volume that is down or disconnected (so it stops publishing capabilities and the scheduler has no fresh backend state), a backed-up RabbitMQ queue or missed heartbeats, or a backend storage array that is so slow it blocks cinder-volume from answering. It is a messaging/availability problem, not usually a bug in the scheduler itself.
What does "filtering removed all hosts" mean?
It means the scheduler did run, evaluated every backend, and each one was rejected by a filter — so you end up with No valid backend found. This is a capacity or capabilities problem, not a pure timeout: a backend may be out of free_capacity_gb, fail the CapabilitiesFilter (volume type extra-specs don’t match), or be in the wrong availability zone. Read the Filter ... returned 0 hosts line to see which filter did the removing.
Why does cinder get-pools time out?
cinder get-pools --detail asks the scheduler to report the pool/capability data it has collected from every cinder-volume. If a backend is unreachable or the scheduler is blocked on RPC, the call hangs and eventually times out instead of returning promptly. A timing-out get-pools is a strong signal that the scheduler cannot get fresh capability reports — chase the cinder-volume service and RabbitMQ, not the CLI.
Is a cinder scheduler timeout a RabbitMQ problem?
Often, yes — the scheduler and volume services talk to each other over RabbitMQ, so a queue backlog, a dead consumer (consumers=0), or missed heartbeats will surface as a MessagingTimeout in Cinder. But not always: a healthy broker with a cinder-volume that is blocked on a slow backend produces the same error. Confirm queue depth and consumers first — our RabbitMQ RPC timeout guide walks the broker side in detail.
How do I safely restart cinder-volume and cinder-scheduler?
Verify there are no in-flight critical operations (a volume mid-migration, a large clone, an active backup) before restarting cinder-volume, because a restart mid-operation can leave a volume in a transitional state. When the scheduler cannot see a backend, restart the consumercinder-volume — first so it re-registers and re-publishes capabilities, then restart cinder-scheduler only if it still holds stale state. Restart the narrowest container that explains the symptom.
How do I fix cinder-volume showing down?
Run openstack volume service list and check updated_at — if it is older than service_down_time the service stopped heart-beating. First confirm the backend it drives (Ceph, LVM, NetApp) is reachable and healthy, then check its RabbitMQ connection and the cinder-volume log for driver errors. If the backend is fine and the service is simply wedged, docker restart cinder_volume makes it re-register and re-publish capabilities, which usually clears the scheduler timeout.