Optimizing Resource Usage with OpenStack Watcher

Most OpenStack clouds drift toward waste. The scheduler places instances when they’re created and then forgets about them, so over months you end up with hosts at 90% on one rack and 20% on another, fragmented memory, and a power bill for compute nodes running three idle VMs each. Watcher is the service that fixes that drift on a schedule, automatically.

I’ve used Watcher to claw back real capacity on production clouds, and I’ve also watched it trigger a live-migration storm that made everything worse. The difference is entirely in how conservatively you configure it. Here’s how I run it.

The Watcher mental model

Watcher works in three nouns:

A goal is what you want — server_consolidation, workload_balancing, airflow_optimization, saving_energy.
A strategy is the algorithm that achieves the goal — e.g. vm_workload_consolidation or workload_stabilization.
An audit runs a strategy against current cluster state and produces an action plan — an ordered list of migrations and host power changes.

The crucial part: an audit produces a plan, and the plan does nothing until you (or a policy) approve it. That gap is your safety valve. Never wire audits straight to auto-execute until you’ve watched plans for a few weeks.

Setting up your first audit

Start with a one-shot audit so you can see what Watcher wants to do before it does anything:

# List available goals and strategies
openstack optimize goal list
openstack optimize strategy list

# Create a one-shot consolidation audit
openstack optimize audit create \
  -g server_consolidation \
  --name nightly-consolidation \
  --audit-type ONESHOT

# Inspect the resulting action plan
openstack optimize actionplan list
openstack optimize action list --action-plan <plan-id>

Read that action list carefully. It will show you exactly which instances move where and which hosts it intends to power down. For consolidation on a busy cloud, the first plan is often alarmingly aggressive — that’s your signal to tune thresholds, not to execute.

Tuning the consolidation strategy

The vm_workload_consolidation strategy packs VMs onto fewer hosts so empties can sleep. The knobs that matter live in the strategy parameters:

openstack optimize audit create \
  -g server_consolidation \
  --strategy vm_workload_consolidation \
  --audit-type CONTINUOUS \
  --interval 3600 \
  -p period=3600 \
  -p granularity=300

Two failure modes to design against:

Migration storms. If too many migrations queue at once you saturate your storage and management network and tank live-migration success. Cap concurrency in the action plan execution and prefer a longer interval so plans are small and incremental.
Thrashing. Consolidate at night, a workload spikes, the balancer spreads it back out, consolidation packs it again. Use a workload_stabilization strategy with sane thresholds rather than chasing every fluctuation. Watcher needs a metrics source — Gnocchi or Prometheus via the datasource config — and bad metrics produce bad plans.

Executing and watching

When a plan looks right, execute it and watch:

openstack optimize actionplan start <plan-id>

# Track progress
watch openstack optimize action list --action-plan <plan-id>

Each action transitions PENDING -> ONGOING -> SUCCEEDED (or FAILED). A failed migration leaves the instance where it was — Watcher doesn’t unwind the whole plan, it just stops. That’s mostly fine, but it means a partial plan can leave the cluster in a half-balanced state, so re-audit after failures rather than assuming the goal was met.

The datasource is the whole game

I cannot stress this enough: Watcher is only as smart as its metrics. Point it at a datasource that actually reflects load:

[watcher_datasources]
datasources = gnocchi

[watcher_cluster_data_model_collector.compute]
period = 3600

If Gnocchi is sampling sparsely or your ceilometer pipeline is dropping metrics, Watcher sees flat CPU everywhere and either does nothing or consolidates based on allocation rather than real usage. Before you trust a single plan, confirm the metrics it’s reading match what your monitoring shows. I keep an AI prompt that takes a Watcher action plan plus the current openstack hypervisor list output and flags any migration that targets an already-hot host — a cheap sanity check before I hit start. There are a few of these in our prompt library.

How I actually deploy it

My production pattern:

Weeks 1–3: ONESHOT audits only, reviewed by a human each morning. Build trust, tune thresholds.
Weeks 4+: CONTINUOUS audits with a long interval, plans auto-generated but execution gated by a low concurrency cap.
Always: consolidation runs in the maintenance window, never during peak, and never the night before a known traffic event.

That cadence has let me recover meaningful host capacity — fewer powered compute nodes, lower draw — without a single Watcher-induced incident.

Where to go next

Watcher pays for itself on any cloud large enough to drift, but it rewards caution: audit before you execute, fix your datasource before you trust a plan, and cap migration concurrency hard. Treat the first few weeks as observation, not automation. For the rest of the OpenStack operations stack — Nova, live migration, monitoring — browse the OpenStack category.

Action plans are recommendations. Always review which instances and hosts a plan touches before executing it against production.