AWS Step Functions Execution Debugging Prompt

Diagnose a failed or stuck Step Functions execution — retries, Catch handlers, IAM denials, timeouts, and payload-size limits — from the execution history, then fix the state machine without masking the real error.

Target user

Serverless, platform, and integration engineers running AWS Step Functions workflows

Difficulty

Intermediate

Tools

Claude, ChatGPT, Cursor

You are a senior cloud engineer who debugs Step Functions executions from the execution history outward. You read the event timeline to find the first thing that actually broke, you distinguish a transient failure that should be retried from a logic error that should not, and you never paper over a real fault with a Catch that swallows it. I will provide: - The execution history or the failed event sequence (state entered/exited, TaskFailed, ExecutionFailed events with Error and Cause): [EXECUTION_HISTORY] - The state machine definition (ASL JSON) or the relevant states: [STATE_MACHINE_DEFINITION] - The execution role and any resource policies for the integrated services: [IAM_CONFIG] - The input payload and the integration pattern used (Request/Response, .sync, or waitForTaskToken): [INPUT_AND_PATTERN] Do the following, numbered: 1. Locate the first real failure in the history: walk the event timeline to the earliest TaskFailed / ExecutionFailed / ExecutionAborted and read its `Error` and `Cause` fields verbatim. Distinguish the originating error from downstream noise — later states often fail only because an upstream one did. 2. Classify the error type precisely: separate AWS-thrown errors (`States.TaskFailed`, `States.Timeout`, `States.Permissions`, `States.DataLimitExceeded`, `Lambda.TooManyRequestsException`, `Lambda.ServiceException`) from application errors the task returned, because the right fix differs for each. 3. Audit the Retry configuration against the error: confirm the failing error name is actually matched by a Retry `ErrorEquals`, check `MaxAttempts`, `IntervalSeconds`, `BackoffRate`, and `MaxDelaySeconds`, and flag both missing retries on transient errors (throttling, `Lambda.ServiceException`) and harmful retries on non-transient logic errors (which only burn time and money). 4. Audit the Catch configuration: verify Catch blocks match the right `ErrorEquals`, check `ResultPath` so the error object doesn't overwrite state needed downstream, and — critically — flag any Catch that routes a genuine failure to a "success" path and silently hides it. A Catch should handle a failure, not disguise it. 5. Check ordering of Retry vs. Catch: confirm Retry is exhausted before Catch fires (Retry runs first), and that the error reaches Catch only after retries are spent — a common surprise when an execution "fails fast" because no matching Retry exists. 6. Diagnose `States.Permissions` and access-denied causes: map the failing integration to the exact IAM action the execution role needs (e.g. `lambda:InvokeFunction` for the specific function ARN, `dynamodb:PutItem` on the table, `sqs:SendMessage` on the queue), and check resource-based policies on the target. Recommend the minimal action and resource ARN to add — never a wildcard. 7. Diagnose timeouts: distinguish a state-level `TimeoutSeconds` / `HeartbeatSeconds` expiry from an underlying Lambda function timeout, and from a `.sync` integration where Step Functions waits on the downstream job (ECS task, Glue job, EMR step) far longer than expected. Identify which timer fired. 8. Check the integration pattern matches the work: confirm `.sync` is used where the workflow must wait for a job to finish, and that `waitForTaskToken` tasks actually receive a `SendTaskSuccess`/`SendTaskFailure` — a missing token callback is the classic cause of an execution stuck until its timeout. 9. Check payload size and shape: flag `States.DataLimitExceeded` from exceeding the state I/O payload limit (256 KB), and recommend passing large data by S3 pointer instead of inline. Verify `InputPath`/`ResultPath`/`OutputPath`/`Parameters` filtering produces the shape the next state expects. 10. Confirm idempotency before recommending a retry-heavy fix: if states have side effects (writes, payments, sends), verify the task is safe to retry or made idempotent, so a Retry or a manual re-run doesn't double-apply an effect. Output as: (a) the root-cause statement quoting the actual Error/Cause from the history, (b) the error classification (transient vs. logic vs. permission vs. timeout vs. data-limit), (c) the specific ASL or IAM fix with the exact field or action to change, and (d) any secondary risks (swallowing Catch, non-idempotent retry, oversized payload) ranked by severity. Recommend reproducing in a non-production state machine where feasible. Apply least-privilege to every IAM change — name the specific action and the specific resource ARN, never a wildcard — and present all state-machine and role changes as a reviewed proposal to apply and re-run against a non-production execution first, never an automatic edit to a live workflow.

Why this prompt works

Step Functions executions fail in layers, and the trap is treating the loudest error as the cause. A downstream state usually fails only because an upstream one did, so this prompt starts where the truth lives — the execution history — and walks the event timeline to the earliest real failure, reading the Error and Cause fields verbatim. It then forces a precise classification, because the fix for a transient Lambda.ServiceException is the opposite of the fix for an application logic error, and conflating them produces either pointless retries or a workflow that gives up when it should have waited.

The Retry and Catch audit is where most state-machine bugs actually hide. Engineers add retries that don’t match the error name, or that hammer a non-transient failure, or they add a Catch whose ResultPath clobbers downstream state. The most dangerous pattern of all is a Catch that routes a genuine failure to a success branch, so the execution reports success while the work silently fails — the prompt singles this out because it is both common and nearly invisible until data goes missing. Checking the Retry-then-Catch ordering closes the remaining surprise about why an execution failed fast instead of recovering.

The remaining steps cover the failures unique to Step Functions’ integration model — States.Permissions denials traced to a missing IAM action on a specific ARN, the difference between a state timeout and an underlying job timeout, stuck waitForTaskToken tasks with no callback, and the 256 KB payload limit that forces large data through S3 pointers. By insisting every IAM fix name a specific action and resource rather than a wildcard, by checking idempotency before recommending retries, and by presenting state-machine and role changes as reviewed proposals tested on a non-production execution first, the prompt keeps a human in control while still driving straight to the real fault.

AWS Step Functions Execution Debugging Prompt

Why this prompt works

Related prompts

IAM Least-Privilege From CloudTrail Usage Prompt

Why this prompt works

Related prompts

IAM Least-Privilege From CloudTrail Usage Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet