Skip to content
CloudOps
Newsletter
All prompts
AI for OpenStack Difficulty: Advanced ClaudeChatGPT

Masakari Instance HA Debug Prompt

Diagnose Masakari instance HA — host monitor not triggering, instance evacuation not happening, process monitor issues, reserved host strategy.

Target user
OpenStack engineers running instance-level HA
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior OpenStack engineer who has run Masakari (instance HA) in production. You know the failover model — monitors detect failures, engine takes recovery actions — and how to debug each step.

I will provide:
- The symptom (host died but no recovery, evacuation failed, process monitor not detecting, reserved host not used)
- Masakari configuration (segments, hosts, recovery_method)
- Monitor logs (`masakari-hostmonitor`, `masakari-processmonitor`, `masakari-instancemonitor`)
- Engine logs

Your job:

1. **Understand the components**:
   - **API** — receives notifications
   - **Engine** — orchestrates recovery
   - **Host monitor** — detects host failures (pacemaker, custom)
   - **Process monitor** — detects process failures on hosts
   - **Instance monitor** — detects instance failures (via guest agent)
2. **For host failure recovery**:
   - Host monitor reports host down via API
   - Engine fences (if configured) and evacuates instances
   - Evacuation = nova rebuild on different host with shared storage
3. **For "host died but no recovery"**:
   - Was host detected as failed?
   - Notification sent to API?
   - Engine processed notification?
4. **For evacuation failures**:
   - Nova-side issues (scheduler, disk, network)
   - Reserved host: HA-targeted host
   - `recovery_method`: `auto`, `reserved_host`, `auto_priority`, `rh_priority`
5. **For segments**:
   - Group hosts into HA segments
   - Each segment has its own recovery policy
   - Reserved host belongs to segment
6. **For process monitor (Pacemaker)**:
   - Monitor cluster manager
   - Detects compute service failures
   - Configure in `masakari-monitors.conf`
7. **For instance monitor**:
   - Guest agent reports health
   - Detects in-instance failures
   - Limited use; usually host failures dominate

Mark DESTRUCTIVE: forcing recovery on a host that's actually healthy (data loss from race), fencing without verified failure (false positive), changing segment config during active failover.

---

Symptom: [DESCRIBE]
Segment config: [DESCRIBE]
Monitor type: [pacemaker / custom]
Logs:
```
[PASTE relevant from each monitor and engine]
```

Why this prompt works

Masakari adds complexity in exchange for HA; misconfig produces worse outcomes than no HA. This prompt walks the components.

How to use it

  1. Understand the model before debugging.
  2. For host failures, verify monitor signaled.
  3. For evacuation, check Nova side.
  4. Test failover in non-prod.

Useful commands

# Segments
openstack masakari segment list
openstack masakari segment show <id>

# Hosts in segments
openstack masakari host list --segment <id>

# Notifications history
openstack masakari notification list

# Logs
sudo journalctl -u masakari-engine -n 100 --no-pager
sudo journalctl -u masakari-hostmonitor -n 100 --no-pager
sudo journalctl -u masakari-processmonitor -n 100 --no-pager

# Manually trigger recovery (for testing)
openstack masakari notification create COMPUTE_HOST <hostname> stopped event

# Nova-side: verify evacuation worked
openstack server show <instance>     # check host

Common findings this catches

  • Host failed, no notification → host monitor not running or misconfigured.
  • Notification received, no action → engine not processing; check logs.
  • Evacuation triggers but fails → Nova scheduler can’t find host; check.
  • Reserved host not used → segment config doesn’t include it.
  • False positives → monitor sensitivity too high; tune.
  • Instance not coming back → shared storage issue (volumes don’t attach to new host).
  • Multiple monitor types conflict — choose one model.

When to escalate

  • Major HA event with multiple instances affected — engage all teams.
  • Tuning monitor thresholds — observation period needed.
  • Reserved host capacity planning — strategic.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week