You are a senior storage engineer who has run Ceph as the unified backend for OpenStack — Glance images, Cinder volumes, Nova ephemeral, sometimes Swift compatibility — at scale. I will provide: - Ceph cluster size (OSDs, pools, devices) - OpenStack integration pattern (separate pools per service or shared) - Workload mix - Symptom (slow boot, slow snapshot, capacity issue, PG inactive) Your job: 1. **Recommend pool design**: - **`images`** (Glance) — replication 3, raw images, no snapshots usually - **`volumes`** (Cinder) — replication 3, RBD with snapshots - **`vms`** (Nova ephemeral) — replication 3, RBD - **`backups`** (Cinder backup) — replication 2 or EC, cheaper - **`pgnum`** — sizing based on `(OSDs × 100) / replicas / pool_count` 2. **For Glance raw conversion**: - Upload as raw or convert (`enable_image_conversion = True`) - Raw on Ceph = instant clone for boot; qcow2 requires full copy - Huge speedup for boot times 3. **For Cinder + Nova clone**: - When boot-from-volume, Nova clones from Glance image - RBD clone is copy-on-write — instant - Requires Glance image in same Ceph cluster 4. **For performance**: - **Replication 3** = standard; 2 risky, EC for cold data - **BlueStore** with NVMe DB/WAL on separate device - **PG count** per pool: too low = imbalanced; too high = OSD CPU pressure - **Cache tier** for hot data (mostly deprecated; use BlueStore caching) 5. **For capacity**: - **`%full`** alert at 80%; **`%nearfull`** at 85% - Failed OSDs trigger reweight; capacity must accommodate - Plan for at least N+2 (2 spare OSDs worth) 6. **For snapshots**: - RBD snapshots cheap (CoW) - But many snapshots slow performance; clean up - Cinder snapshot vs Glance image: similar mechanism 7. **For OSD failures**: - Recovery uses cluster bandwidth - `osd_recovery_max_active`, `osd_recovery_op_priority` tune - During recovery, client I/O may slow 8. **For inactive PGs**: - PG can be inactive due to missing OSDs (size 3 but only 2 up) - `ceph health detail` shows specifics Mark DESTRUCTIVE: deleting pool with data, reweighting OSDs aggressively (cluster-wide recovery storm), enabling pool snapshots without cleanup. --- Ceph topology: [DESCRIBE] Integration pattern: [pools per service, shared, etc.] Workload mix: [DESCRIBE] Symptom: [DESCRIBE]

Why this prompt works

Ceph + OpenStack is the dominant pattern at scale but mis-configured pools or PG counts cause subtle issues. This prompt walks the design.

How to use it

Design pools per service with appropriate replication.
Convert images to raw for clone speedup.
Plan capacity for failure.
Monitor PG state.

Useful commands

# Cluster health
ceph status
ceph health detail
ceph df

# Pools
ceph osd pool ls
ceph osd pool get <pool> all

# Create pool
ceph osd pool create images 256 256 replicated
ceph osd pool application enable images rbd
rbd pool init images

# Pool snapshot count
rbd ls <pool>
rbd snap ls <pool>/<image>

# PG state
ceph pg dump
ceph pg ls inactive

# OSD perf
ceph osd perf

# Capacity per pool
ceph df detail

# Cinder config sample
sudo cat /etc/cinder/cinder.conf | grep -A10 rbd

# Glance config sample
sudo cat /etc/glance/glance-api.conf | grep -A5 rbd

# Test boot time
time openstack server create --boot-from-volume 50 \
    --image <image-id> --flavor <flavor> --network <net> testvm

Pool design patterns

# Production
ceph osd pool create images 512 512 replicated
ceph osd pool create volumes 1024 1024 replicated
ceph osd pool create vms 1024 1024 replicated
ceph osd pool create backups 512 512 erasure ec-profile

# Set replication
ceph osd pool set images size 3
ceph osd pool set images min_size 2

# Application tags
ceph osd pool application enable images rbd
ceph osd pool application enable volumes rbd
ceph osd pool application enable vms rbd

Common findings this catches

Slow boot times with qcow2 in Glance → switch to raw + RBD clone.
Cinder volumes slow → check RBD pool, network, OSD health.
PG count too low → re-create or expand carefully.
Snapshots consuming 30%+ of pool — set retention.
Cluster 90%+ full — emergency expansion.
OSD down causing slow recovery → tune recovery params; add capacity.
EC pool used for high-IOPS workload → re-pool to replicated.

When to escalate

Major Ceph outage — storage team.
Cluster expansion — coordinated plan.
Performance tuning at scale — Ceph experts.

Reading prompts? Get all 500 in one free PDF

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response

Instant PDF download — yours free, forever

Plus one practical AI-workflow email a week (no spam)

Ceph + OpenStack Integration Tuning Prompt

Why this prompt works

How to use it

Useful commands

Pool design patterns

Common findings this catches

When to escalate

Related prompts

Cinder Volume Troubleshooting Prompt

Glance Image Lifecycle Management Prompt

OpenStack Capacity Planning Prompt

Cinder Multi-Backend & Volume-Type Design Prompt

Reading prompts? Get all 500 in one free PDF