AWS Error Guide: 'CannotPullContainerError' ECS Task Image Pull Failures
Fix the ECS CannotPullContainerError: diagnose ECR auth, missing images and tags, no route to the registry, private subnet endpoints, and Docker Hub rate limits.
- #aws
- #troubleshooting
- #errors
- #ecs
Overview
CannotPullContainerError is the stopped-task reason ECS reports when the container runtime on the host could not download the image referenced by a task definition. The agent asked the runtime to pull the image, the pull failed, and the task moved to STOPPED before any container started. The failure happens at image-pull time, so application logs are empty — the diagnosis lives in the task’s stoppedReason.
You see it in the stopped task or service events:
CannotPullContainerError: pull image manifest has been retried 5 time(s): failed to resolve ref "123456789012.dkr.ecr.us-east-1.amazonaws.com/api:v42": failed to do request: Head "https://123456789012.dkr.ecr.us-east-1.amazonaws.com/v2/api/manifests/v42": dial tcp 10.0.3.17:443: i/o timeout
It occurs on every task launch: a service deployment, a scheduled task, or a run-task. The cause is almost always one of authentication, the image/tag not existing, or no network path from the task’s subnet to the registry.
Symptoms
- Tasks cycle
PROVISIONINGtoSTOPPEDwithstoppedReasoncontainingCannotPullContainerError. - A service deployment never reaches steady state; events repeat the pull failure.
- The same image runs locally but fails in a private subnet.
stoppedReasonincludesi/o timeout,manifest unknown,403 Forbidden, ortoomanyrequests.
aws ecs describe-tasks --cluster prod --tasks <TASK_ARN> \
--query 'tasks[0].[lastStatus,stoppedReason]' --output text
STOPPED CannotPullContainerError: failed to resolve ref "123456789012.dkr.ecr.us-east-1.amazonaws.com/api:v42": ... dial tcp 10.0.3.17:443: i/o timeout
aws ecs describe-services --cluster prod --services api \
--query 'services[0].events[0:3].message' --output text
(service api) failed to launch a task with (error CannotPullContainerError: ...).
Common Root Causes
1. The execution role cannot authenticate to ECR
ECS uses the task execution role to pull from ECR. If it lacks ecr:GetAuthorizationToken / ecr:BatchGetImage, the pull returns 403.
aws ecs describe-task-definition --task-definition api \
--query 'taskDefinition.executionRoleArn' --output text
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::123456789012:role/ecsTaskExecutionRole \
--action-names ecr:GetAuthorizationToken ecr:BatchGetImage \
--query 'EvaluationResults[].[EvalActionName,EvalDecision]' --output text
ecr:GetAuthorizationToken implicitDeny
ecr:BatchGetImage allowed
implicitDeny on GetAuthorizationToken blocks every ECR pull — attach AmazonECSTaskExecutionRolePolicy or the missing action.
2. The image tag does not exist in the repository
The task definition references a tag that was never pushed (typo, failed CI push, or the tag was deleted). The registry returns manifest unknown.
aws ecr describe-images --repository-name api \
--query 'reverse(sort_by(imageDetails,&imagePushedAt))[0:5].imageTags' --output text
['v41', 'latest'] ['v40'] ['v39']
The task wants v42 but the newest tag is v41 — the image was never pushed. Fix the tag or push the build.
3. No network route from a private subnet to the registry
A task in a private subnet with no NAT gateway and no ECR VPC endpoints cannot reach ECR or S3 (ECR layers live in S3). The pull times out (i/o timeout).
aws ec2 describe-vpc-endpoints \
--filters Name=vpc-id,Values=vpc-0abc1234 \
--query 'VpcEndpoints[].ServiceName' --output text
com.amazonaws.us-east-1.ssm
Only the SSM endpoint exists — no ecr.api, ecr.dkr, or s3 endpoint and (presumably) no NAT. Add the three endpoints or a NAT route.
4. Wrong region or account in the image URI
The image URI points to ECR in a different region or account than the task is running in, so the host resolves a registry it cannot reach or is not authorized for.
aws ecs describe-task-definition --task-definition api \
--query 'taskDefinition.containerDefinitions[0].image' --output text
123456789012.dkr.ecr.us-west-2.amazonaws.com/api:v42
The task runs in us-east-1 but the URI references us-west-2 — use the same-region repo or set up cross-region/replication access.
5. Docker Hub or public registry rate limit
Pulling a public image (nginx, redis) anonymously hits Docker Hub’s pull-rate limit, returning toomanyrequests.
aws ecs describe-tasks --cluster prod --tasks <TASK_ARN> \
--query 'tasks[0].stoppedReason' --output text
CannotPullContainerError: ... toomanyrequests: You have reached your pull rate limit.
Use ECR Public, mirror the image into ECR, or supply registry credentials via repositoryCredentials.
6. Security group or NACL blocks egress 443
The task’s security group or the subnet NACL blocks outbound HTTPS, so even with a NAT/endpoint the TLS connection to the registry fails.
aws ec2 describe-security-groups --group-ids sg-0task1234 \
--query 'SecurityGroups[0].IpPermissionsEgress[].[IpProtocol,FromPort,ToPort]' --output text
tcp 443 443
If this shows no 443/all egress rule (or the NACL denies it), the pull cannot establish a connection.
Diagnostic Workflow
Step 1: Read the precise stoppedReason
aws ecs describe-tasks --cluster <CLUSTER> --tasks <TASK_ARN> \
--query 'tasks[0].[stopCode,stoppedReason]' --output text
The tail of the message classifies it: i/o timeout = network, 403 Forbidden = auth, manifest unknown = missing tag, toomanyrequests = rate limit.
Step 2: Verify the image and tag exist
aws ecr describe-images --repository-name <REPO> \
--image-ids imageTag=<TAG> --query 'imageDetails[0].imageTags' --output text
A ImageNotFoundException here means the tag is the problem, not the network.
Step 3: Confirm the execution role’s ECR permissions
aws iam simulate-principal-policy \
--policy-source-arn <EXECUTION_ROLE_ARN> \
--action-names ecr:GetAuthorizationToken ecr:BatchGetImage ecr:GetDownloadUrlForLayer
Any implicitDeny/explicitDeny explains a 403.
Step 4: Check the network path from the task’s subnet
aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=<VPC_ID> \
--query 'VpcEndpoints[].ServiceName' --output text
aws ec2 describe-route-tables --filters Name=association.subnet-id,Values=<SUBNET> \
--query 'RouteTables[].Routes[?contains(to_string(@),`nat`)]' --output json
A private subnet needs either a NAT route or the ecr.api, ecr.dkr, and s3 (gateway) endpoints.
Step 5: Re-run the task and watch the new reason
aws ecs run-task --cluster <CLUSTER> --task-definition <TD> \
--launch-type FARGATE --network-configuration "awsvpcConfiguration={subnets=[<SUBNET>],securityGroups=[<SG>]}"
aws ecs describe-tasks --cluster <CLUSTER> --tasks <NEW_TASK_ARN> \
--query 'tasks[0].stoppedReason' --output text
A changed (or empty) reason confirms the fix or points to the next layer.
Example Root Cause Analysis
A Fargate service, payments, deployed a new task definition and immediately began failing with CannotPullContainerError. The stopped reason ended in i/o timeout, pointing at the network rather than auth.
The service had recently been moved to private subnets. Checking endpoints:
aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=vpc-0abc1234 \
--query 'VpcEndpoints[].ServiceName' --output text
com.amazonaws.us-east-1.ecr.api com.amazonaws.us-east-1.ecr.dkr
The ecr.api and ecr.dkr interface endpoints existed, but the S3 gateway endpoint was missing. ECR stores image layers in S3, so the manifest HEAD resolved but the layer download timed out.
aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=vpc-0abc1234 \
--query 'VpcEndpoints[?contains(ServiceName,`s3`)]' --output text
Empty — no S3 endpoint and no NAT. Fix: create the S3 gateway endpoint and associate it with the private subnets’ route tables.
aws ec2 create-vpc-endpoint --vpc-id vpc-0abc1234 \
--service-name com.amazonaws.us-east-1.s3 --vpc-endpoint-type Gateway \
--route-table-ids rtb-0priv5678
The next deployment pulled the image and reached steady state.
Prevention Best Practices
- Always attach
AmazonECSTaskExecutionRolePolicy(or its ECR actions) to the task execution role — it is the role ECS uses to pull, distinct from the task role. - For tasks in private subnets, provision all three pull endpoints:
ecr.api,ecr.dkr, and thes3gateway endpoint; the S3 endpoint is the one most often forgotten. - Pin task definitions to immutable image digests or build-stamped tags, not
latest, so a missing/overwritten tag never silently breaks a deploy. - Mirror public base images into ECR (or use ECR Public) to avoid Docker Hub
toomanyrequestsrate limits in CI and at scale. - Keep the image URI’s region and account aligned with where tasks run, or configure cross-region replication.
- For fast classification of a stoppedReason, the free incident assistant can tell auth, missing-tag, and network failures apart from the message. More ECS walkthroughs are in the AWS guides.
Quick Command Reference
# Precise stop reason
aws ecs describe-tasks --cluster <CLUSTER> --tasks <TASK_ARN> \
--query 'tasks[0].[stopCode,stoppedReason]' --output text
# Does the tag exist?
aws ecr describe-images --repository-name <REPO> --image-ids imageTag=<TAG> \
--query 'imageDetails[0].imageTags' --output text
# Execution role ECR permissions
aws iam simulate-principal-policy --policy-source-arn <EXEC_ROLE> \
--action-names ecr:GetAuthorizationToken ecr:BatchGetImage ecr:GetDownloadUrlForLayer
# VPC endpoints present (need ecr.api, ecr.dkr, s3)
aws ec2 describe-vpc-endpoints --filters Name=vpc-id,Values=<VPC_ID> \
--query 'VpcEndpoints[].ServiceName' --output text
# Image URI in the task definition
aws ecs describe-task-definition --task-definition <TD> \
--query 'taskDefinition.containerDefinitions[0].image' --output text
Conclusion
CannotPullContainerError means the container runtime could not download the image before the task could start. The usual root causes:
- The task execution role lacks
ecr:GetAuthorizationToken/ecr:BatchGetImage. - The referenced image tag does not exist in the repository.
- No network route (NAT or ECR + S3 endpoints) from a private subnet to the registry.
- A wrong region or account in the image URI.
- A Docker Hub / public registry rate limit (
toomanyrequests). - A security group or NACL blocking outbound 443.
Read the tail of stoppedReason first — it classifies the failure as auth, missing-tag, or network — then fix that one layer and re-run the task.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.