Automation Error Guide: 'Action timed out' StackStorm/Rundeck Job Failed
Fix StackStorm and Rundeck job failed / sensor error / action timeout errors: diagnose runner timeouts, dead sensors, SSH/node failures, missing config, and pack issues.
- #automation
- #troubleshooting
- #errors
- #stackstorm
Overview
A StackStorm action timeout or Rundeck job failure means a runner started executing a step but did not complete within the configured timeout, or a sensor/trigger that should have fired an execution errored and stopped emitting. StackStorm runs actions via runners (local-shell, remote-shell over SSH/Paramiko, python, http) and watches for events via sensors; Rundeck runs job steps across nodes over SSH/WinRM. When a runner hangs past its timeout it’s killed and the action is marked timeout/failed; when a sensor crashes, the automations it was supposed to trigger simply never run.
You will see the action timeout in the StackStorm execution:
st2.actionrunner Action execution 6601...f timed out after 60s status=timeout
result: {"failed": true, "succeeded": false, "error": "Action timed out"}
Or a Rundeck step failure across nodes:
[node02] FAILED: Failed: SSHProtocolFailure: Connection timed out
Execution #4821 failed. 1/3 nodes failed.
Or a dead sensor in the StackStorm sensorcontainer log:
ERROR st2reactor.sensor Sensor 'mypack.FileWatchSensor' crashed: FileNotFoundError: '/var/spool/in'
WARN respawning sensor mypack.FileWatchSensor (attempt 5)
It occurs when an action runs (manually, on a schedule, or trigger-driven) or when a sensor polls. A workflow that worked can fail the moment a target node is slow/unreachable, a credential expires, a watched path disappears, or a runner timeout is set too tight for the work.
Symptoms
- StackStorm execution ends in
timeoutorfailedwith “Action timed out”. - Rundeck job shows
FAILEDon one or more nodes, oftenSSHProtocolFailure/Connection timed out. - A sensor logs a crash and respawns repeatedly; expected trigger-driven executions stop appearing.
- Manual re-run of the same action also hangs or fails on the same node.
# StackStorm: inspect the failed execution
st2 execution get 6601f --detail | grep -iE "status|timeout|error|stderr" | head
status: timeout
result.error: Action timed out
# Rundeck: list recent failed executions for a job
rd executions query --project ops --recent 1h --status failed --max 5
4821 job=restart-app status=failed node02 duration=120s
Common Root Causes
1. Runner timeout shorter than the work
The action legitimately takes longer than its timeout parameter, so it’s killed mid-run.
st2 action get mypack.long_backup | grep -iE "timeout|runner"
runner_type: remote-shell-script
timeout: 60
A 60s timeout on a backup that takes minutes guarantees a timeout status. Raise the action/runner timeout.
2. Target node unreachable / SSH failure
The remote-shell runner (StackStorm) or node executor (Rundeck) can’t reach the host or authenticate, so the step hangs then fails.
# Reproduce the SSH path the runner uses
ssh -o ConnectTimeout=5 -i /home/stanley/.ssh/stanley_rsa stanley@node02 'echo ok'
ssh: connect to host node02 port 22: Connection timed out
A failing SSH here is the real cause; the runner just surfaces it as a timeout/failure.
3. Sensor crashed / not registered
A sensor referencing a missing path, bad config, or unhandled exception crashes and the sensorcontainer respawns it in a loop — no triggers fire.
st2 sensor list --pack mypack
journalctl -u st2sensorcontainer --no-pager | grep -iE "crash|respawn|Traceback" | tail
ERROR Sensor mypack.FileWatchSensor crashed: FileNotFoundError '/var/spool/in'
A repeatedly crashing sensor means its trigger-driven workflows are effectively dead.
4. Missing pack config / datastore key
The action needs a config value or datastore key (API token, host) that isn’t set, so it errors immediately or hangs waiting.
st2 key list --scope system | grep -i mypack
cat /opt/stackstorm/configs/mypack.yaml 2>/dev/null | grep -iE "api_url|token" | head
(no matching key)
A missing mypack.api_token or empty config makes the action fail on first use.
5. Command itself hangs (no TTY / interactive prompt)
A remote command waits on stdin or a prompt (e.g., sudo asking for a password, an interactive installer) and never returns, hitting the timeout.
ssh -i /home/stanley/.ssh/stanley_rsa stanley@node02 'sudo systemctl restart app' 2>&1 | head
sudo: a terminal is required to read the password
A sudo password prompt with no NOPASSWD/TTY hangs the runner until timeout.
6. Rundeck node filter / thread pool / dispatch issue
A job dispatched to many nodes with a low thread count or a bad node filter stalls or partially fails.
rd jobs info --id <jobid> --project ops | grep -iE "nodefilter|threadcount|node"
nodefilter: tags: web
threadcount: 1
A threadcount: 1 across many nodes serializes execution; a stale tag filter can target dead nodes.
Diagnostic Workflow
Step 1: Read the execution detail and classify
# StackStorm
st2 execution get <id> --detail | grep -iE "status|error|stderr|timeout"
# Rundeck
rd executions query --project <proj> --recent 1h --status failed --max 5
timeout → runner timeout or a hanging command. failed with SSH errors → node/credential. Sensor missing → the trigger never fired.
Step 2: Reproduce the runner’s remote command directly
ssh -o ConnectTimeout=5 -i <key> <user>@<node> '<the action command>'
If the raw SSH/command fails or hangs, you’ve found the cause outside the automation layer.
Step 3: Check the action/runner timeout vs actual duration
st2 action get <pack.action> | grep -iE "timeout|runner_type"
If the command’s real duration exceeds timeout, raise it (and prefer async patterns for long jobs).
Step 4: Verify sensors are alive and registered
st2 sensor list --pack <pack>
journalctl -u st2sensorcontainer --no-pager | grep -iE "crash|respawn|Traceback" | tail -20
A crash-looping sensor must be fixed (missing path/config) before its triggers will fire again.
Step 5: Confirm config, datastore keys, and node filters
st2 key list --scope system | grep -i <pack>
# Rundeck
rd jobs info --id <jobid> --project <proj> | grep -iE "nodefilter|threadcount"
Set missing keys/config; correct node filters and raise thread count for fan-out jobs.
Example Root Cause Analysis
A StackStorm action infra.restart_app that restarts a service on node02 starts ending in timeout after exactly 60s. The same action on node01 and node03 succeeds.
The execution detail shows nothing but the timeout:
status: timeout
result.error: Action timed out
Since two of three nodes work, this is node-specific. Reproducing the exact remote command the runner uses:
ssh -i /home/stanley/.ssh/stanley_rsa stanley@node02 'sudo systemctl restart app'
sudo: a terminal is required to read the password; a password is required
node02 is missing the NOPASSWD sudoers entry that the other nodes have (it was rebuilt from an older image). The sudo prompt blocks on stdin forever, so the runner hangs until the 60s timeout kills it.
Fix: restore the NOPASSWD rule for the stanley user on node02 (matching the other nodes) and re-run:
# On node02: /etc/sudoers.d/stanley -> stanley ALL=(ALL) NOPASSWD: ALL
st2 run infra.restart_app host=node02
status: succeeded
With no password prompt, the command returns immediately and the action succeeds.
Prevention Best Practices
- Set action/runner timeouts to the work’s real p99 duration with headroom, and use async/polling patterns for genuinely long jobs instead of one giant blocking action.
- Standardize node access (keys, NOPASSWD sudoers, WinRM) across the whole fleet via config management so one rebuilt node doesn’t hang on a prompt.
- Monitor the sensor container: alert when a sensor crash-loops, because its trigger-driven workflows fail silently while the rest of the system looks healthy.
- Validate that every pack’s required config and datastore keys exist before enabling its actions; treat missing keys as a deploy-blocking check.
- Keep node filters and thread counts sane for Rundeck fan-out jobs so a stale tag or
threadcount: 1doesn’t stall or mis-target. - Alert on execution failure/timeout rates per action. The free incident assistant can group failed executions by node/error into a likely cause; see more automation guides.
Quick Command Reference
# StackStorm: inspect a failed/timed-out execution
st2 execution get <id> --detail | grep -iE "status|error|stderr|timeout"
st2 action get <pack.action> | grep -iE "timeout|runner_type"
# Reproduce the runner's remote command
ssh -o ConnectTimeout=5 -i <key> <user>@<node> '<command>'
# Sensor health
st2 sensor list --pack <pack>
journalctl -u st2sensorcontainer | grep -iE "crash|respawn|Traceback" | tail
# Config and datastore keys
st2 key list --scope system | grep -i <pack>
# Rundeck: failed executions and job config
rd executions query --project <proj> --recent 1h --status failed --max 5
rd jobs info --id <jobid> --project <proj> | grep -iE "nodefilter|threadcount"
Conclusion
A StackStorm/Rundeck action timeout or job failure means a runner hung or a step/sensor errored. The usual root causes:
- A runner timeout shorter than the work’s real duration.
- The target node is unreachable or SSH/auth fails.
- A sensor crashed and crash-loops, so its triggers never fire.
- Missing pack config or datastore keys the action needs.
- The remote command hangs on an interactive prompt (e.g.,
sudowithout NOPASSWD). - A Rundeck node-filter / thread-count / dispatch problem.
Read the execution detail to classify timeout vs failed, then reproduce the runner’s remote command directly — most of these failures are a single node or credential issue the automation layer merely surfaces as a timeout.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.