Masakari Instance HA Debug Prompt
Diagnose Masakari instance HA — host monitor not triggering, instance evacuation not happening, process monitor issues, reserved host strategy.
- Target user
- OpenStack engineers running instance-level HA
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack engineer who has run Masakari (instance HA) in production. You know the failover model — monitors detect failures, engine takes recovery actions — and how to debug each step. I will provide: - The symptom (host died but no recovery, evacuation failed, process monitor not detecting, reserved host not used) - Masakari configuration (segments, hosts, recovery_method) - Monitor logs (`masakari-hostmonitor`, `masakari-processmonitor`, `masakari-instancemonitor`) - Engine logs Your job: 1. **Understand the components**: - **API** — receives notifications - **Engine** — orchestrates recovery - **Host monitor** — detects host failures (pacemaker, custom) - **Process monitor** — detects process failures on hosts - **Instance monitor** — detects instance failures (via guest agent) 2. **For host failure recovery**: - Host monitor reports host down via API - Engine fences (if configured) and evacuates instances - Evacuation = nova rebuild on different host with shared storage 3. **For "host died but no recovery"**: - Was host detected as failed? - Notification sent to API? - Engine processed notification? 4. **For evacuation failures**: - Nova-side issues (scheduler, disk, network) - Reserved host: HA-targeted host - `recovery_method`: `auto`, `reserved_host`, `auto_priority`, `rh_priority` 5. **For segments**: - Group hosts into HA segments - Each segment has its own recovery policy - Reserved host belongs to segment 6. **For process monitor (Pacemaker)**: - Monitor cluster manager - Detects compute service failures - Configure in `masakari-monitors.conf` 7. **For instance monitor**: - Guest agent reports health - Detects in-instance failures - Limited use; usually host failures dominate Mark DESTRUCTIVE: forcing recovery on a host that's actually healthy (data loss from race), fencing without verified failure (false positive), changing segment config during active failover. --- Symptom: [DESCRIBE] Segment config: [DESCRIBE] Monitor type: [pacemaker / custom] Logs: ``` [PASTE relevant from each monitor and engine] ```
Why this prompt works
Masakari adds complexity in exchange for HA; misconfig produces worse outcomes than no HA. This prompt walks the components.
How to use it
- Understand the model before debugging.
- For host failures, verify monitor signaled.
- For evacuation, check Nova side.
- Test failover in non-prod.
Useful commands
# Segments
openstack masakari segment list
openstack masakari segment show <id>
# Hosts in segments
openstack masakari host list --segment <id>
# Notifications history
openstack masakari notification list
# Logs
sudo journalctl -u masakari-engine -n 100 --no-pager
sudo journalctl -u masakari-hostmonitor -n 100 --no-pager
sudo journalctl -u masakari-processmonitor -n 100 --no-pager
# Manually trigger recovery (for testing)
openstack masakari notification create COMPUTE_HOST <hostname> stopped event
# Nova-side: verify evacuation worked
openstack server show <instance> # check host
Common findings this catches
- Host failed, no notification → host monitor not running or misconfigured.
- Notification received, no action → engine not processing; check logs.
- Evacuation triggers but fails → Nova scheduler can’t find host; check.
- Reserved host not used → segment config doesn’t include it.
- False positives → monitor sensitivity too high; tune.
- Instance not coming back → shared storage issue (volumes don’t attach to new host).
- Multiple monitor types conflict — choose one model.
When to escalate
- Major HA event with multiple instances affected — engage all teams.
- Tuning monitor thresholds — observation period needed.
- Reserved host capacity planning — strategic.
Related prompts
-
Nova Live Migration Troubleshooting Prompt
Diagnose Nova live migration failures — shared storage requirements, block migration, network bandwidth, CPU compatibility, error 'migration aborted'.
-
OpenStack Capacity Planning Prompt
Plan OpenStack capacity — CPU/RAM/disk oversubscription, growth modeling, hypervisor sizing, Cinder backend planning, network bandwidth.
-
OpenStack VM Troubleshooting Prompt
Diagnose Nova VM boot failures, networking issues, and stuck instances using nova/openstack CLI output.