Skip to content
CloudOps
All prompts
AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Ceph + OpenStack Integration Tuning Prompt

Tune Ceph as storage backend for OpenStack — Glance, Cinder, Nova ephemeral pools; performance tuning, capacity planning, snapshot/clone semantics.

Target user
Storage engineers integrating Ceph with OpenStack
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior storage engineer who has run Ceph as the unified backend for OpenStack — Glance images, Cinder volumes, Nova ephemeral, sometimes Swift compatibility — at scale.

I will provide:
- Ceph cluster size (OSDs, pools, devices)
- OpenStack integration pattern (separate pools per service or shared)
- Workload mix
- Symptom (slow boot, slow snapshot, capacity issue, PG inactive)

Your job:

1. **Recommend pool design**:
   - **`images`** (Glance) — replication 3, raw images, no snapshots usually
   - **`volumes`** (Cinder) — replication 3, RBD with snapshots
   - **`vms`** (Nova ephemeral) — replication 3, RBD
   - **`backups`** (Cinder backup) — replication 2 or EC, cheaper
   - **`pgnum`** — sizing based on `(OSDs × 100) / replicas / pool_count`
2. **For Glance raw conversion**:
   - Upload as raw or convert (`enable_image_conversion = True`)
   - Raw on Ceph = instant clone for boot; qcow2 requires full copy
   - Huge speedup for boot times
3. **For Cinder + Nova clone**:
   - When boot-from-volume, Nova clones from Glance image
   - RBD clone is copy-on-write — instant
   - Requires Glance image in same Ceph cluster
4. **For performance**:
   - **Replication 3** = standard; 2 risky, EC for cold data
   - **BlueStore** with NVMe DB/WAL on separate device
   - **PG count** per pool: too low = imbalanced; too high = OSD CPU pressure
   - **Cache tier** for hot data (mostly deprecated; use BlueStore caching)
5. **For capacity**:
   - **`%full`** alert at 80%; **`%nearfull`** at 85%
   - Failed OSDs trigger reweight; capacity must accommodate
   - Plan for at least N+2 (2 spare OSDs worth)
6. **For snapshots**:
   - RBD snapshots cheap (CoW)
   - But many snapshots slow performance; clean up
   - Cinder snapshot vs Glance image: similar mechanism
7. **For OSD failures**:
   - Recovery uses cluster bandwidth
   - `osd_recovery_max_active`, `osd_recovery_op_priority` tune
   - During recovery, client I/O may slow
8. **For inactive PGs**:
   - PG can be inactive due to missing OSDs (size 3 but only 2 up)
   - `ceph health detail` shows specifics

Mark DESTRUCTIVE: deleting pool with data, reweighting OSDs aggressively (cluster-wide recovery storm), enabling pool snapshots without cleanup.

---

Ceph topology: [DESCRIBE]
Integration pattern: [pools per service, shared, etc.]
Workload mix: [DESCRIBE]
Symptom: [DESCRIBE]

Why this prompt works

Ceph + OpenStack is the dominant pattern at scale but mis-configured pools or PG counts cause subtle issues. This prompt walks the design.

How to use it

  1. Design pools per service with appropriate replication.
  2. Convert images to raw for clone speedup.
  3. Plan capacity for failure.
  4. Monitor PG state.

Useful commands

# Cluster health
ceph status
ceph health detail
ceph df

# Pools
ceph osd pool ls
ceph osd pool get <pool> all

# Create pool
ceph osd pool create images 256 256 replicated
ceph osd pool application enable images rbd
rbd pool init images

# Pool snapshot count
rbd ls <pool>
rbd snap ls <pool>/<image>

# PG state
ceph pg dump
ceph pg ls inactive

# OSD perf
ceph osd perf

# Capacity per pool
ceph df detail

# Cinder config sample
sudo cat /etc/cinder/cinder.conf | grep -A10 rbd

# Glance config sample
sudo cat /etc/glance/glance-api.conf | grep -A5 rbd

# Test boot time
time openstack server create --boot-from-volume 50 \
    --image <image-id> --flavor <flavor> --network <net> testvm

Pool design patterns

# Production
ceph osd pool create images 512 512 replicated
ceph osd pool create volumes 1024 1024 replicated
ceph osd pool create vms 1024 1024 replicated
ceph osd pool create backups 512 512 erasure ec-profile

# Set replication
ceph osd pool set images size 3
ceph osd pool set images min_size 2

# Application tags
ceph osd pool application enable images rbd
ceph osd pool application enable volumes rbd
ceph osd pool application enable vms rbd

Common findings this catches

  • Slow boot times with qcow2 in Glance → switch to raw + RBD clone.
  • Cinder volumes slow → check RBD pool, network, OSD health.
  • PG count too low → re-create or expand carefully.
  • Snapshots consuming 30%+ of pool — set retention.
  • Cluster 90%+ full — emergency expansion.
  • OSD down causing slow recovery → tune recovery params; add capacity.
  • EC pool used for high-IOPS workload → re-pool to replicated.

When to escalate

  • Major Ceph outage — storage team.
  • Cluster expansion — coordinated plan.
  • Performance tuning at scale — Ceph experts.

Related prompts

Newsletter

Get weekly AI workflows for DevOps engineers

Practical prompts, automation ideas, and tool reviews for infrastructure engineers. One email per week. No spam.