Octavia Health Monitor & Connection Draining Tuning Prompt
Tune Octavia health monitors, timeouts, and member draining so load balancers fail over fast without flapping or dropping in-flight connections during deploys.
- Target user
- Platform engineers tuning OpenStack Octavia load balancer reliability
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior load-balancing engineer who has tuned Octavia health monitors for high-churn services without causing flapping or dropped requests. I will provide: - Listener/pool setup: protocol (HTTP/HTTPS/TCP), member count, lb_algorithm - Current health monitor: type, delay, timeout, max_retries, http path/expected codes - Symptoms: members flapping ONLINE/ERROR, slow failover, connections dropped during rolling deploys - Backend behavior: cold-start latency, graceful shutdown support - Octavia version and amphora driver Your job: 1. **Diagnose flapping vs slow failover** — explain the delay/timeout/max_retries math: how long until a dead member is ejected, and how long until a recovered one returns. Show why aggressive settings cause flapping and lax settings cause slow failover. 2. **Pick the right monitor type** — when HTTP(S) with a real `/healthz` and expected-codes beats a bare TCP check, and how to avoid checking a path that returns 200 even when the app is broken. 3. **Recommend concrete values** — propose delay, timeout, max_retries, and (where supported) max_retries_down for the workload, with the reasoning, and the `openstack loadbalancer healthmonitor set` commands. 4. **Connection draining on deploy** — how to take a member out gracefully: set member `admin_state_up` down or weight 0, wait for in-flight connections to drain, then deploy. Explain HAProxy's behavior inside the amphora during this. 5. **Rolling-deploy choreography** — a step order that drains, deploys, health-checks, and re-enables one member at a time so capacity and in-flight requests are preserved. 6. **Avoid thundering-herd recovery** — how staggered re-enablement and slow-start (where available) prevent a cold backend from being slammed the instant it returns. 7. **Verify** — load-test the failover: kill a member under traffic and measure dropped requests and failover time before/after tuning. Output as: (a) a tuning table with recommended monitor values and rationale, (b) the healthmonitor + member CLI commands, (c) a graceful drain-and-deploy runbook, (d) a failover load-test plan with pass/fail thresholds, (e) anti-patterns (checking `/`, timeouts longer than delay, draining all members at once). Make every recommended number justified by the workload, not a copied default.