When you create a volume, cinder-api validates the request and RPC-casts it onto RabbitMQ for
cinder-scheduler. The scheduler runs its filters and weighers over every backend it knows about,
picks the best cinder-volume host, and RPC-casts the create down to that host, which drives the
actual storage backend (LVM, Ceph, NetApp, and friends). A cinder scheduler timeout — an
oslo.messaging MessagingTimeout raised in the scheduler or API — means one of those RPC hops asked
for a reply and never got it in time. The usual culprits are a cinder-volume that is down or
disconnected, a RabbitMQ queue that is backed up or has no consumer, or a backend array so slow that
cinder-volume blocks and never answers.
Before you start, separate two failure modes that look similar in the CLI but have opposite root causes. A MessagingTimeout is a messaging/availability failure: the request could not reach a healthy service to be scheduled at all. "No valid backend found" / "filtering removed all hosts" is the opposite — the scheduler did run and every candidate backend was rejected by a filter, so it is a capacity or capabilities problem. Chasing RabbitMQ when the real issue is an exhausted backend (or vice versa) is how a 20-minute incident becomes a two-hour one. This guide is the runbook I use to tell them apart fast; want it in your hand during the incident? Grab the free runbook pack above.
Symptoms
You are probably here because you are seeing one or more of these:
- New volumes hang in
creating(and never reachavailable), or land inerrorafter a long wait. cinder-schedulerlogsMessagingTimeout: Timed out waiting for a reply to message ID ....openstack volume service listshows acinder-volumeasdownwith a staleupdated_at.cinder get-pools --detailhangs and eventually times out instead of returning your pools.- The API returns
No valid backend found, and the scheduler log readsFilter ... returned 0 hosts/filtering removed all hosts.
Likely causes
In production OpenStack, an openstack cinder scheduler timeout almost always traces back to one of these:
- cinder-volume down or disconnected. If the volume service isn’t running (or lost RabbitMQ), it stops sending capability reports, the scheduler’s view of that backend goes stale, and RPC casts to it time out.
- RabbitMQ RPC backlog or heartbeats. A backed-up
cinder-scheduler/cinder-volumequeue, a dead consumer, or missed heartbeats stalls the reply — the classic cinder scheduler rpc timeout. - Backend storage slow or unreachable. A degraded Ceph cluster, a full LVM VG, or a NetApp/SAN that is not answering makes
cinder-volumeblock on the driver, so it can’t reply to the scheduler or publish capabilities. - Scheduler filters too strict / capacity exhausted. When every backend fails a filter you get
filtering removed all hosts— a capacity or extra-specs mismatch, not a timeout. - Capabilities not published. If
report_intervalis too long relative toservice_down_time, or reports are dropped, the scheduler treats live backends as down.
Immediate checks
Sixty seconds of triage tells you which of the two failure modes you’re in. Start with the service list — is the volume backend even up, and how fresh is its heartbeat?
# Status + Updated At: a 'down' state or an old timestamp is your lead
openstack volume service list
# Same view from the cinder CLI (shows the host@backend and State/Updated_At)
cinder service-list A cinder-volume in state down with a stale Updated At means it stopped reporting — the scheduler has no fresh backend state and will time out. If every service is up and recent, suspect a filter/capacity problem instead.
Now read what the scheduler and volume services are actually logging. The distinction between a
MessagingTimeout and a filtering removed all hosts line is the whole ballgame:
# MessagingTimeout => availability/RPC problem; 'returned 0 hosts' => capacity/filter problem
docker logs --tail=120 cinder_scheduler 2>&1 \
| grep -Ei "MessagingTimeout|No valid backend|returned 0 hosts|filtering removed"
# Driver/backend errors on the volume side (blocked on the array => can't reply)
docker logs --tail=120 cinder_volume 2>&1 \
| grep -Ei "error|timeout|traceback|unable to|refused" Let the log pick your path: a MessagingTimeout sends you to the RabbitMQ + volume-service checks; a 'returned 0 hosts' line sends you to the capacity/capabilities checks.
Diagnostic commands
RabbitMQ RPC timeout checks
# Scheduler/volume queues: messages climbing with consumers=0 means nobody is draining them
docker exec rabbitmq rabbitmqctl list_queues name messages consumers \
| awk 'NR==1 || $1 ~ /cinder-scheduler|cinder-volume/ {print}'
# Blocked connections (memory/disk watermark backpressure) and heartbeat drops
docker exec rabbitmq rabbitmqctl list_connections state | grep -c blocked
docker logs --tail=100 rabbitmq 2>&1 | grep -Ei "missed heartbeats|partition|closing" consumers=0 on a cinder-volume queue means the volume service isn't subscribed — restart it so it re-consumes. Blocked connections mean RabbitMQ is applying backpressure. See the RabbitMQ RPC timeout guide for the full broker workup.
Volume backend health
# What backend did cinder-volume try to talk to, and did the driver error?
docker logs --tail=200 cinder_volume 2>&1 \
| grep -Ei "backend|driver|ceph|rbd|lvm|netapp|failed|timed out"
# Ceph backend: is the cluster healthy and responsive?
docker exec ceph_mon ceph -s 2>/dev/null || ceph -s
# LVM backend: does the volume group exist and have free extents?
vgs; vgdisplay cinder-volumes 2>/dev/null | grep -Ei "free|VG Size" A cinder-volume that is 'up' but blocked on a slow or unreachable array will still fail to reply to the scheduler. Fix the storage layer first — restarting cinder-volume against a dead backend just moves the hang.
cinder get-pools troubleshooting
# Healthy: returns pools with capacity fields promptly (a second or two)
cinder get-pools --detail
# Time it — a hang here means the scheduler can't assemble fresh capability data
time cinder get-pools --detail
# Look for name, free_capacity_gb, total_capacity_gb, allocated_capacity_gb per pool get-pools queries the scheduler, which needs fresh capability reports from every cinder-volume. A prompt reply lists your pools with capacity fields; a hang/timeout means a backend is unreachable or the scheduler is blocked on RPC — go back to the service list and RabbitMQ checks.
Scheduler filters
# Default filters if unset: AvailabilityZoneFilter, CapacityFilter, CapabilitiesFilter
docker exec cinder_scheduler grep -E "scheduler_default_filters|scheduler_default_weighers" \
/etc/cinder/cinder.conf
# The scheduler names the filter that zeroed out the candidates
docker logs --tail=200 cinder_scheduler 2>&1 \
| grep -Ei "Filter .* returned 0 hosts|filtering removed all hosts|passing candidates" If CapacityFilter returned 0 hosts you're out of space; if CapabilitiesFilter did, a volume type's extra-specs don't match any backend; if AvailabilityZoneFilter did, no backend serves the requested AZ.
Backend capacity & capabilities
# Per-pool capacity the scheduler is weighing on
cinder get-pools --detail \
| grep -Ei "name|free_capacity_gb|total_capacity_gb|reserved_percentage|max_over_subscription"
# Volume type extra-specs that must match a backend's reported capabilities
openstack volume type list
openstack volume type show <type> -f value -c properties 'filtering removed all hosts' with healthy services almost always means free_capacity_gb is exhausted, reserved_percentage is eating the headroom, thin over-subscription is maxed, or the requested volume-type extra-specs match no backend.
Kolla-Ansible container checks
# Are cinder_scheduler / cinder_volume / cinder_api up, or restarting?
docker ps --filter "name=cinder" --format "table {{.Names}} {{.Status}}"
# Recent restarts or crash loops leave a trail in each container's log
docker logs --tail=60 cinder_scheduler 2>&1 | tail -n 20
docker logs --tail=60 cinder_volume 2>&1 | tail -n 20 A cinder_volume container stuck in a restart loop (short, resetting uptime in Status) explains a scheduler timeout on its own — the service never stays up long enough to publish capabilities.
Fix & remediation steps
Map the deciding log line to the smallest safe remediation:
- Scheduler can’t see the backend (MessagingTimeout, service down) → restart the consumer,
cinder-volume, first so it re-registers and re-publishes capabilities. - Scheduler holding stale state after the volume recovers → restart
cinder-schedulerso it rebuilds its host/pool view. - RabbitMQ consumer dead (consumers=0) → restarting
cinder-volumere-subscribes it; treat the broker itself as a last resort. - "filtering removed all hosts" / capacity → do not restart anything. Add capacity, lower
reserved_percentage, fix the volume-type extra-specs, or relax the offending filter.
# 1) Restart the consumer FIRST so it re-registers + re-publishes capabilities
docker restart cinder_volume
# 2) Only if the scheduler still holds stale backend state, restart it
docker restart cinder_scheduler
# 3) Confirm the service came back up and is heart-beating again
openstack volume service list Restart cinder-volume before cinder-scheduler when the scheduler can't see the backend. If it's a filter/capacity problem, fix capacity or the filter instead — a restart won't conjure free space.
Grab the copy/paste version of this runbook
The RabbitMQ RPC Timeout Runbook Pack bundles the messaging-side commands behind this page — cluster health, queue depth, consumer and heartbeat checks, and a service restart decision tree — in one print-ready PDF that covers Cinder, Nova, Neutron, and Heat.
- OpenStack RPC timeout checklist
- RabbitMQ cluster + queue depth commands
- Consumer / publisher + heartbeat checks
- oslo.messaging config review
- Nova/Cinder/Neutron/Heat symptom matrix
- Service restart decision tree + notes template
No account needed · single opt-in · we never share your email.
Validation steps
Don’t declare victory on the restart alone — confirm the whole path is healthy again:
openstack volume service listshows thecinder-volumeasupwith a freshupdated_at.cinder get-pools --detailreturns promptly with real capacity fields (no hang).- The scheduler log no longer prints
MessagingTimeoutorreturned 0 hostson new requests. - An end-to-end create+delete of a small test volume succeeds and the scheduler picks a backend.
openstack volume service list -f value -c Binary -c Status | sort | uniq -c
time cinder get-pools --detail >/dev/null && echo "get-pools OK"
# Prove the full API -> scheduler -> volume path end to end
openstack volume create --size 1 sched-check
openstack volume show sched-check -f value -c status -c os-vol-host-attr:host
openstack volume delete sched-check A 1 GiB volume that reaches 'available' with a host@backend#pool assigned confirms the scheduler picked a live backend over the RPC path. Clean it up so you don't leave test volumes behind.
Prevention
- Alert on cinder-volume liveness — page when
updated_atages pastservice_down_time, not when a user reports a stuck volume. A stuck cinder volume should never be your first signal. - Monitor RabbitMQ queue depth for the Cinder queues as a leading indicator of RPC timeouts — see the RabbitMQ RPC timeout guide.
- Watch backend free capacity and over-subscription so you fix "filtering removed all hosts" and no weighed backends available before a create fails.
- Review scheduler filters against your volume types — an extra-specs mismatch quietly removes every host; our Cinder block storage troubleshooting walkthrough covers the common traps.
- Turn recurring incidents into reusable runbooks, and use the free AI Incident Response assistant to draft triage steps fast — including for a volume stuck in creating.
Want the always-current prompts and tools behind this workflow? Browse the AI prompt library, the free in-browser DevOps tools, and — when a production storage incident needs senior hands — work with me directly.