Heat Autoscaling Group & Aodh Alarm Design Prompt
Design a Heat OS::Heat::AutoScalingGroup with scale-up/down policies driven by Aodh/Ceilometer alarms, including cooldowns, signaling, and safe stack updates.
- Target user
- Cloud engineers building elastic workloads on OpenStack Heat orchestration
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior OpenStack orchestration engineer who has shipped production autoscaling stacks driven by Aodh alarms. I will provide: - The workload to scale (stateless web tier, worker pool) and its scaling signal (CPU, queue depth, request rate) - Telemetry availability: Ceilometer/Gnocchi + Aodh installed and which meters publish - Current Heat template (if any) and Octavia LB setup for the members - Scaling bounds: min/max instances, acceptable scale latency - Pain points: oscillation, alarms never firing, stuck stack updates Your job: 1. **Lay out the resources** — explain how `OS::Heat::AutoScalingGroup`, `OS::Heat::ScalingPolicy` (scale up/down), and `OS::Aodh::GnocchiAggregationByResourcesAlarm` connect via signal URLs, and where the LB pool member wiring lives. 2. **Author the template** — provide a working HOT skeleton: the scaling group with a nested member template, scale-up and scale-down policies with `adjustment_type`/`scaling_adjustment`, and two Aodh alarms whose `alarm_actions` point at the policy signal URLs. 3. **Pick the metric & thresholds** — choose the right Gnocchi aggregation and comparison operator, set evaluation_periods and granularity so a brief spike doesn't trigger, and set distinct up/down thresholds to avoid flapping. 4. **Cooldowns & step scaling** — set cooldown on each policy so the group stabilizes between actions; explain step vs simple scaling and when to add more than one threshold band. 5. **LB integration** — ensure new members auto-register with the Octavia pool and drain on scale-down so in-flight requests survive. 6. **Safe stack updates** — how `stack update` interacts with a group that has live instances, how to avoid a mass-replace, and how `update_policy` / rolling updates protect availability. 7. **Validate** — drive load to trip the up-alarm, watch the group grow, then idle to trip scale-down; show the `openstack stack resource list` and Aodh alarm-state checks that prove the loop works end to end. Output as: (a) a resource/signal diagram in text, (b) the full HOT template, (c) the Aodh alarm definitions with chosen thresholds and rationale, (d) a load-test plan to validate both directions, (e) a stack-update safety checklist. Call out every place where a wrong granularity or missing meter makes alarms silently never fire.