Diagnosing ECS and Fargate Task Failures With AI

The first time a Fargate task started crash-looping in our production cluster, I instinctively reached for ssh and remembered, too late, that there’s no host to SSH into. Fargate hands you a stopped task, a one-line stoppedReason, and an exit code, and expects you to reconstruct the whole story from that. The reasons are famously terse — Essential container in task exited, CannotPullContainerError, ResourceInitializationError — and each one maps to a completely different root cause. Decoding them used to mean a lot of tab-switching between the ECS console, CloudWatch, and IAM. Now I let AI do the decoding, and it’s faster because the model has seen every one of these stopped reasons a thousand times.

The boundary stays the same as always: AI interprets the symptoms and points me at the likely cause; I confirm against the live task data and own the fix. The model can’t see your cluster — it reasons over what you give it.

Pull the full stopped-task record

The single highest-value command is describing the stopped task, because it carries the stop reason, the exit code, and the per-container detail all at once.

# Most recent stopped tasks
aws ecs list-tasks --cluster prod --desired-status STOPPED \
  --query 'taskArns' --output text

aws ecs describe-tasks --cluster prod --tasks <task-arn> \
  --query 'tasks[0].{stopped:stoppedReason,containers:containers[].{name:name,reason:reason,exit:exitCode,status:lastStatus}}'

That JSON is the whole crime scene. Combined with the task definition and the application logs, it’s everything AI needs.

Let AI map the stopped reason to a cause

The reason text is a category, not an answer, and the model knows the decoder ring well. Here’s the prompt I use:

This is a stopped ECS Fargate task. Below is the describe-tasks output (stoppedReason + per-container exitCode and reason) and the task definition JSON. For the stoppedReason and exit code, tell me the most likely root cause and the specific thing to check. Distinguish clearly between: image-pull failures (registry auth / ECR perms / missing tag), resource init failures (ENI / secrets / EFS mount), application crashes (exit code from the app), and OOM kills (exitCode 137). Be specific about which AWS resource or task-def field to inspect.

For one crash loop it returned exactly the right fork:

stoppedReason “Essential container in task exited”, exitCode 137. 137 = 128 + 9 (SIGKILL), and on Fargate that almost always means the container exceeded its memory and was OOM-killed by the platform, not your app exiting cleanly. Check the task definition: containerDefinitions[].memory is 512 but your app’s logs show it climbing past that before death. Either raise the container memory limit or the task-level memory, and check for a leak. This is NOT an application bug per se — it’s a memory limit too low for the workload.

Exit code 137 routinely gets misdiagnosed as an app crash. The model catching the SIGKILL math and pointing at the memory limit is the kind of decode that saves an hour.

The pull and init failures are usually IAM or config

CannotPullContainerError and ResourceInitializationError are the other two big buckets, and they’re almost never the app’s fault. Gather the relevant config and let AI cross-check:

aws ecs describe-task-definition --task-definition checkout:42 \
  --query 'taskDefinition.{exec:executionRoleArn,task:taskRoleArn,containers:containerDefinitions[].{image:image,secrets:secrets,logs:logConfiguration}}'

The stopped reason is ResourceInitializationError: unable to pull secrets or registry auth. The task def references a secret via secretsmanager ARN and pulls from a private ECR repo. The executionRoleArn is set. Check whether the execution role has BOTH secretsmanager:GetSecretValue on that secret ARN and the ECR pull actions (ecr:GetAuthorizationToken, ecr:BatchGetImage, ecr:GetDownloadUrlForLayer). Also confirm the task is in a subnet with a route to ECR/Secrets Manager (NAT gateway or VPC endpoints), since init pulls happen over the network.

That last sentence is the one people miss: a private-subnet Fargate task with no NAT and no VPC endpoint can’t reach ECR, and the symptom looks identical to an IAM problem. The model surfacing both possibilities is what lets me check the right thing first. I confirm the IAM half with a quick simulate, and the network half with the routing checks from the VPC connectivity guide.

Verify against the actual logs before fixing

Whatever the model concludes, I prove it from the application logs, because the stopped reason can mislead. The container’s CloudWatch log stream has the last words before death:

aws logs tail /ecs/checkout --since 30m --follow false \
  | grep -iE "error|fatal|out of memory|panic" | tail -20

If the model said OOM and the logs show heap-allocation failures right before exit, those two independent signals agreeing is my green light. If they disagree — model says OOM but the logs show a clean SIGTERM shutdown from a deploy — then the “crash” was actually a normal task replacement and there’s no bug at all. That cross-check is non-negotiable.

Fix narrowly and confirm the loop stops

For the OOM case, the fix is a one-field bump, not a panic over-provision:

{
  "name": "checkout",
  "image": "111122223333.dkr.ecr.us-east-1.amazonaws.com/checkout:1.4.2",
  "memory": 1024,
  "essential": true
}

Register the new revision, update the service, and watch the running count stabilize. AI is useful one more time here: ask it to review the new task def for other limits that are now too tight relative to the workload you just described — it’ll often catch the CPU/memory ratio being off before it bites you next.

The takeaway

Fargate’s opacity — no host, terse reasons, bare exit codes — is exactly what makes AI valuable for it: the model has the decoder ring for every stopped reason and exit code memorized, and it reasons across the task def, IAM, and networking at once. But it’s decoding symptoms, not watching your cluster. So the loop holds: pull the full stopped-task record, let AI map reason-and-code to a likely cause, confirm against the real application logs, fix the one thing, and watch the loop stop.

The IAM and networking root causes here overlap heavily with the rest of the AWS guides, and the stopped-reason decode prompts are in the prompts collection.