Autoscaling Clusters with OpenStack Senlin

People come to OpenStack from AWS expecting an Auto Scaling Group and a target-tracking policy, and they’re surprised when Nova doesn’t have one. Heat can scale a stack, but Heat’s scaling is clunky and stateful in ways that hurt. The service actually built for “keep N healthy nodes and scale them on a signal” is Senlin.

I’ve used Senlin to run autoscaling fleets on OpenStack for years. It’s underrated and underused, partly because people reach for Heat autoscaling first and get burned. Here’s how I run Senlin properly.

The Senlin model: profiles, clusters, policies

Three concepts and you understand Senlin:

A profile describes a single node — usually a Nova server spec (image, flavor, networks, key). It’s the template every node is stamped from.
A cluster is a managed group of nodes built from one profile, with a desired/min/max capacity.
A policy attaches a behavior to a cluster — scaling, health, load-balancing, deletion order, affinity.

The power is in policies. A cluster with a health policy self-heals dead nodes. A cluster with a scaling policy responds to signals. A cluster with an lb policy registers and deregisters nodes from an Octavia pool automatically. You compose these.

Building a cluster

Define a profile, then a cluster from it:

# web-profile.yaml
type: os.nova.server
version: 1.0
properties:
  flavor: m1.small
  image: ubuntu-22.04
  key_name: ops-key
  networks:
    - network: tenant-net

# Create the profile
openstack cluster profile create --spec-file web-profile.yaml web-profile

# Create a cluster with desired/min/max
openstack cluster create \
  --profile web-profile \
  --desired-capacity 3 \
  --min-size 2 \
  --max-size 10 \
  web-cluster

openstack cluster show web-cluster

Senlin immediately reconciles to the desired capacity by booting nodes. From here, every change is a policy or an explicit resize.

Health policy: self-healing for free

The first policy I attach to any production cluster is health. It polls node status and rebuilds or recreates nodes that go unhealthy:

# health-policy.yaml
type: senlin.policy.health
version: 1.1
properties:
  detection:
    interval: 60
    detection_modes:
      - type: NODE_STATUS_POLLING
  recovery:
    actions:
      - name: RECREATE

openstack cluster policy create --spec-file health-policy.yaml health-pol
openstack cluster policy attach --policy health-pol web-cluster

Now a node that Nova reports as ERROR or that vanishes gets recreated automatically. This alone makes Senlin worth running — it’s the self-healing Nova doesn’t give you natively.

Scaling policies and real autoscaling

A scaling policy defines how much to scale per signal:

type: senlin.policy.scaling
version: 1.0
properties:
  event: CLUSTER_SCALE_OUT
  adjustment:
    type: CHANGE_IN_CAPACITY
    number: 2
    min_step: 1
    cooldown: 120

The signal that triggers scaling comes from outside Senlin — typically Aodh alarms on Ceilometer/Gnocchi metrics calling Senlin’s webhook (a receiver):

# Create a webhook receiver that triggers scale-out
openstack cluster receiver create \
  --type webhook \
  --cluster web-cluster \
  --action CLUSTER_SCALE_OUT \
  scale-out-hook

Point an Aodh alarm at the receiver’s URL when CPU crosses a threshold, and you have closed-loop autoscaling: metric -> alarm -> webhook -> Senlin scales. The cooldown is the single most important knob — too short and you flap, adding and removing nodes faster than they can warm up.

Wiring in Octavia

For a web fleet, attach an lb policy so new nodes auto-register in your load balancer pool and removed nodes deregister cleanly:

type: senlin.policy.loadbalance
version: 1.1
properties:
  pool:
    protocol: HTTP
    protocol_port: 80
  health_monitor:
    type: HTTP
    url_path: /healthz

This closes the last gap — scaling that doesn’t update the load balancer is useless. With the lb policy, scale-out adds a backend and scale-in drains and removes one.

The failure modes I watch for

Flapping. Cooldowns too short, or scale-out and scale-in alarms with overlapping thresholds. Leave a dead band between scale-out and scale-in triggers.
Scale-in killing the wrong node. Attach a deletion policy so scale-in removes the oldest or least-loaded node, not a random one mid-request.
Stuck actions. Senlin actions queue; a wedged action blocks the cluster. Check openstack cluster action list when a cluster stops responding to resizes.

I keep an AI prompt that takes a cluster’s action history plus the attached policies and tells me whether a flapping cluster is a cooldown problem or an overlapping-threshold problem — it disentangles the two faster than I do by eye. A few of these are in our prompt library.

Senlin vs Heat autoscaling

My rule: use Senlin for anything that needs to scale on load, self-heal, and integrate with a load balancer. Use Heat for orchestrating the surrounding infrastructure and let Heat reference the Senlin cluster as a resource. Heat’s own AutoScalingGroup works but couples scaling to stack updates, which gets painful. Senlin keeps the cluster lifecycle independent.

Where to go next

Senlin is the autoscaling primitive OpenStack should be famous for. Start with a health policy for self-healing, add a scaling policy driven by Aodh webhooks, and wire in an lb policy so your load balancer stays accurate. Mind the cooldowns and the deletion order and it runs itself. For the Octavia and Heat services it integrates with, see the OpenStack category.

Autoscaling configurations can scale costs as fast as capacity. Validate cooldowns and min/max bounds against your own load patterns before going live.