You are a senior ECS platform engineer. You diagnose task failures by reading the stoppedReason and container exitCode literally, then tracing them to the pull -> network setup -> start -> health-check lifecycle stage where the task actually died. I will provide: - `aws ecs describe-tasks` output (stoppedReason, container exitCode, lastStatus): [DESCRIBE_TASKS] - Service events from `aws ecs describe-services`: [SERVICE_EVENTS] - The task definition essentials (image, CPU/memory, networkMode, execution vs task role, log config): [TASK_DEF] - The symptom (task never starts, starts then stops, fails ALB health checks, OOM): [SYMPTOM] Do the following, numbered: 1. Classify the lifecycle stage of failure from the evidence: image pull, ENI/network attachment, secret/parameter resolution, container start, application crash, or health-check failure. Quote the exact stoppedReason or event that decides it. 2. Map the common signatures: `CannotPullContainerError` (ECR auth via execution role, or no route to ECR endpoints); `ResourceInitializationError` for secrets (execution role lacks `secretsmanager`/`ssm` or KMS, or no endpoint route); exitCode 137 (OOM-killed, raise memory or fix the leak); exitCode 1 / non-zero (app crash, read application logs); health-check failures (wrong path/port, grace period too short). 3. Distinguish execution-role problems (pull, secrets, logging — happen before the app runs) from task-role problems (the app's own AWS API calls — happen after it runs). State which role is implicated. 4. For health-check flaps, check the container health check vs the ALB target-group health check, the path/port, and whether `healthCheckGracePeriodSeconds` is long enough for app startup. Output as: (a) the failing lifecycle stage with the evidence line, (b) the matched signature and root cause, (c) the minimal fix (role permission, memory, endpoint route, health-check setting), (d) the command to confirm (e.g. re-run the task and tail the awslogs stream). Never grant the execution or task role broad permissions to clear a secrets error; scope to the exact secret ARNs. Never push a task-def change straight to a production service without deploying behind a new revision and watching the rollout.

Why this prompt works

ECS task failures are noisy because a task passes through several stages before your code ever runs — image pull, ENI attachment, secret resolution, container start, then health checks — and a failure at any stage surfaces as a generic “task stopped.” The stoppedReason and exitCode contain the real signal, but only if you read them against the lifecycle. This prompt forces the model to name the failing stage and quote the deciding line first, which immediately rules out most causes and prevents the common mistake of debugging the application when the task died during the pull.

The execution-role versus task-role distinction is where most engineers lose hours. Both are IAM roles attached to a task, but the execution role acts before your container runs (pulling the image, fetching secrets, writing logs) while the task role acts after (the app’s own AWS API calls). A ResourceInitializationError is always an execution-role-or-networking problem, never an application bug — yet teams routinely chase it in their code. Making the model state which role is implicated keeps the fix on the right side of the boundary.

The signature catalog — CannotPullContainerError, exitCode 137 for OOM, non-zero for app crashes, health-check flaps — turns cryptic strings into root causes with specific remedies. Pairing each with a scoped fix and a confirmation command (re-run and tail the log stream) means the engineer validates the diagnosis directly rather than trusting it, and the guardrails ensure permission fixes stay least-privilege and task-definition changes roll out behind a reviewed revision.

ECS Fargate Task Failure Diagnosis Prompt

Why this prompt works

Related prompts

CloudWatch Logs Insights and Alarm Design Prompt

Why this prompt works

Related prompts

CloudWatch Logs Insights and Alarm Design Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet