Event-Driven Automation with StackStorm and Rundeck
How to build event-driven ops automation with StackStorm and Rundeck — sensors, rules, workflows, and AI-assisted triggers that act on events safely.
- #automation
- #stackstorm
- #rundeck
- #event-driven
- #sre
- #orchestration
Most ops automation is still cron jobs and humans clicking buttons after an alert fires. Event-driven automation flips that: something happens in your infrastructure, and the right workflow runs in response — no human relay in the middle. I’ve run both StackStorm and Rundeck in anger, and they solve overlapping problems from opposite ends. This is how to pick, wire, and guardrail them.
The event-driven model in one diagram
Every event-driven system has the same four pieces:
- Sensor / trigger — watches a source (webhook, queue, log stream, API poll) and emits structured events.
- Rule — matches events against criteria and decides what to run.
- Workflow / action — the actual work: restart, scale, notify, remediate.
- History — an audit trail of what fired, why, and what happened.
The whole value is collapsing the time between “event happened” and “correct response ran” from minutes-of-human to milliseconds-of-machine. The whole risk is that a bad rule means a wrong response runs at machine speed. Guardrails are not optional.
StackStorm: when events are messy and many
StackStorm (“IFTTT for ops”) is built around sensors and rules. A sensor watches something and emits a trigger; a rule maps that trigger to an action or workflow. It shines when events come from many heterogeneous sources.
A rule looks like this:
---
name: "restart_on_disk_full"
pack: "ops"
trigger:
type: "prometheus.alert"
criteria:
trigger.labels.alertname:
type: "equals"
pattern: "DiskAlmostFull"
trigger.labels.severity:
type: "equals"
pattern: "warning"
action:
ref: "ops.rotate_logs"
parameters:
host: "{{ trigger.labels.instance }}"
dry_run: true
Two things to notice. The criteria are strict — only warning, only DiskAlmostFull. And dry_run: true means this rule logs what it would do until you trust it. StackStorm’s pack ecosystem gives you pre-built sensors for Prometheus, GitHub, AWS, and dozens more, so you’re rarely writing sensor code from scratch.
The strength is composability: chain actions into workflows with Orquesta, StackStorm’s workflow engine, including branching and error handling. The cost is operational weight — StackStorm is a real distributed system (MongoDB, RabbitMQ, multiple services). Don’t run it to schedule three jobs.
Rundeck: when events are few and humans are involved
Rundeck comes at it from the job-runner side. Its core is well-defined jobs with parameters, access control, and a clean API. Events trigger jobs via webhooks rather than a sensor framework.
A Rundeck webhook endpoint maps an incoming POST to a job run:
# webhook -> job mapping (rundeck)
webhook:
name: pagerduty-incident
eventPlugin: webhook-run-job
config:
jobId: a1b2c3-restart-service
argString: "-service ${data.service} -reason ${data.summary}"
Where Rundeck wins is human-in-the-loop. Approval gates, role-based access on a per-job basis, and a UI non-engineers can use make it the better fit when the response needs sign-off or when on-call wants a big obvious “run the runbook” button. It’s lighter to operate and its access control is more mature out of the box.
Adding AI to the trigger, not the action
The temptation is to let a model decide what to run. Resist it at the action layer. The safe place for AI in event-driven automation is classification and enrichment of the event, before the rule matches.
Insert an AI classification step that turns a noisy event into a clean, labeled one:
def enrich(event):
classification = classify_event(event) # returns label + confidence
event["ai_class"] = classification["label"]
event["ai_confidence"] = classification["confidence"]
return event
Now your StackStorm rule or Rundeck webhook matches on ai_class and gates on ai_confidence. The AI never picks the action — it produces a label that a human-written rule maps to an action. If confidence is low, the rule routes to a notify-only action and a human decides. This keeps the deterministic rule as the source of truth and uses AI only to make messy events legible.
Guardrails for anything that runs unattended
Event-driven means unattended, so the guardrails carry the safety:
- Confidence floors on AI-enriched fields. Never auto-act below threshold; route to notify-only.
- Strict criteria. Match on severity, environment, and label — not just alert name. A rule that fires on any
severitywill eventually fire in production at the worst time. - Idempotent actions. Events can fire twice (retries, flapping). Actions must be safe to run twice.
- Rate limits / circuit breakers. If a rule fires N times in a window, stop and escalate. Flapping events should not produce flapping remediations.
- Dry-run rollout. New rules log-only for two weeks. Promote to live only after the log shows it would have done the right thing.
- Environment scoping. Test rules can only target non-prod. Production targeting requires explicit, reviewed config.
Choosing between them
A rough decision guide from running both:
| Need | Lean toward |
|---|---|
| Many heterogeneous event sources | StackStorm |
| Complex multi-step workflows with branching | StackStorm |
| Human approval gates and RBAC per job | Rundeck |
| Non-engineers running ops jobs from a UI | Rundeck |
| Lightweight, few jobs, easy to operate | Rundeck |
| ChatOps + sensor-driven autonomous response | StackStorm |
Plenty of teams run both: Rundeck for human-initiated and approval-gated jobs, StackStorm for autonomous sensor-driven response. They coexist fine.
Where to start
Pick one repetitive, low-risk response you do by hand today — “disk warning on the log host means rotate logs” — and automate exactly that, in dry-run, with strict criteria. Watch the history for two weeks. Then enable it live and add the next one. Event-driven automation compounds: each rule you trust frees attention for the next.
For the events that still need a human, keep a fast triage path — our AI Incident Response Assistant turns symptoms into a safest-first plan — and browse more patterns under AI for Automation.
Event-triggered actions run at machine speed. Roll out every new rule in dry-run and verify against your own systems before going live.