AWS Error Guide: 'Waiter ... failed: Max attempts exceeded'

Exact Error Message

Waiter StackCreateComplete failed: Max attempts exceeded

Other common variants from SDKs and tools:

botocore.exceptions.WaiterError: Waiter ClusterActive failed: Max attempts exceeded
ResourceNotReady: Waiter encountered a terminal failure state for resource
Waiter InstanceRunning failed: Waiter encountered a terminal failure state

What the Error Means

The AWS CLI and SDKs include waiters — helpers that poll a resource on a fixed interval until it reaches a desired state (a stack CREATE_COMPLETE, an EKS cluster ACTIVE, an EC2 instance running). Each waiter has two hard-coded parameters: a delay (seconds between polls) and max_attempts. Multiply them for the total time budget — the default StackCreateComplete waiter polls every 30 seconds up to 120 times (roughly an hour), while many EC2 waiters give up in minutes. Max attempts exceeded means the waiter consumed that budget and the resource still had not reached the target state.

ResourceNotReady / terminal failure state is a different outcome: the resource entered a state the waiter treats as a definitive failure (such as CREATE_FAILED or ROLLBACK_COMPLETE). The waiter does not wait out the budget here — it fails fast the moment it observes that state, because more polling could not change the result.

The crucial point is that the waiter error is a symptom, never the root cause. You can only find the real cause by inspecting the resource’s own status and the events it emits.

Common Causes

The resource genuinely failed. A CloudFormation stack hit CREATE_FAILED/ROLLBACK_COMPLETE due to an underlying error. Here the waiter is doing its job correctly, and increasing the timeout only delays the inevitable.
Provisioning is slower than the waiter’s timeout. Large stacks, RDS instances, or EKS clusters take longer than the default poll budget. A multi-AZ RDS instance can run past a waiter tuned for fast resources — the resource is healthy, but the waiter ran out of attempts.
A dependency is stuck. A nested resource (NAT gateway, ENI, custom resource Lambda) hangs, blocking the parent. CloudFormation will not mark the parent complete until every child settles, so one wedged dependency holds the stack open.
Insufficient capacity or quota. The resource cannot be created because of a service quota or capacity limit. EC2 capacity shortfalls, EIP limits, or vCPU quotas surface deep inside the stack and leave the parent stuck without naming the wall.
Networking misconfiguration. EKS nodes cannot reach the control plane, or a subnet lacks routes, so the cluster never reports healthy. Missing NAT routes, restrictive security groups, or an absent VPC endpoint keep nodes from registering.
Custom resource never signals. A CloudFormation custom resource Lambda fails to send a response, so the stack waits until timeout. If the Lambda errors before its cfn-response call, CloudFormation receives no signal and blocks for the full custom-resource window.

How to Reproduce the Error

Wait on a resource that will not reach the desired state within the budget. For example, watch a stack that is destined to roll back:

aws cloudformation wait stack-create-complete --stack-name demo-stack

Waiter StackCreateComplete failed: Waiter encountered a terminal failure state: For expression "Stacks[].StackStatus" we matched expected path: "ROLLBACK_COMPLETE"

Or wait on a cluster that takes longer than the default attempts:

aws eks wait cluster-active --name demo-cluster

Waiter ClusterActive failed: Max attempts exceeded

Diagnostic Commands

Confirm the caller and region:

aws sts get-caller-identity

For CloudFormation, read the current status and the failure events (the first *_FAILED event names the real cause):

aws cloudformation describe-stacks --stack-name demo-stack \
  --query 'Stacks[0].StackStatus' --output text

aws cloudformation describe-stack-events --stack-name demo-stack \
  --query 'StackEvents[?contains(ResourceStatus, `FAILED`)].[LogicalResourceId,ResourceStatus,ResourceStatusReason]' \
  --output table

For EKS, check the cluster’s status and any health issues:

aws eks describe-cluster --name demo-cluster \
  --query 'cluster.[status,health.issues]' --output json

For EC2, read the instance state and reason:

aws ec2 describe-instances --instance-ids i-0abcd1234 \
  --query 'Reservations[].Instances[].[State.Name,StateReason.Message]' --output text

Step-by-Step Resolution

Treat the waiter error as a pointer, not the cause. Immediately inspect the resource’s real status with the describe-* command for its service. Its current state tells you whether it is mid-flight, failed, or rolling back — which determines every step that follows.
For CloudFormation, read describe-stack-events and find the earliest *_FAILED event — its ResourceStatusReason is the actual error (e.g., a quota, an IAM permission, or a custom resource timeout). Always work from the first failure: later events are usually cascading rollback noise, while the first *_FAILED names the resource that broke.
For EKS, check health.issues and node/subnet wiring; a cluster stuck in CREATING usually points to networking or IAM role problems. Confirm the cluster IAM role has AmazonEKSClusterPolicy and that subnets have routes and security groups that let the control plane and nodes talk.
If it is purely a timing issue, the resource will eventually become ready — increase the waiter budget so automation does not give up early. Most SDK waiters accept custom max_attempts/delay, and the CLI lets you re-run wait. Raise the budget only after confirming the resource is genuinely progressing, not sitting in a failure state.
Resolve the dependency. Clear the stuck nested resource (capacity, ENI, custom resource signal) so the parent can complete — terminating unused instances to free capacity, releasing a leaked ENI, or redeploying the custom resource Lambda. Once the child settles, the parent advances on its next poll.
Verify by re-running the describe-* command until the resource reports the desired state, then re-run the waiter. Confirming the steady state directly ensures your automation will pass on the next attempt.

Prevention and Best Practices

Never rely on a waiter alone for diagnosis — pair every wait with status/event inspection in your automation’s failure path, so a failed wait automatically dumps describe-stack-events into your logs.
Size waiter timeouts to the resource’s realistic provisioning time (RDS, EKS, and large stacks routinely exceed defaults). Benchmark how long real workloads take and set max_attempts/delay with headroom rather than accepting SDK defaults.
Ensure CloudFormation custom resources always send a response (success or failure) so stacks never hang to timeout. Wrap the handler in a try/finally that calls cfn-response even when the function throws.
Pre-check service quotas and capacity before provisioning so resources do not stall on limits — a quick Service Quotas lookup before a large rollout is far cheaper than discovering the wall halfway through a stack.
Validate networking (subnets, routes, security groups, IAM roles) before creating clusters and nodes. Most “stuck CREATING” EKS incidents trace back to a missing route or over-restrictive security group a pre-flight check would have caught.
Emit the first *_FAILED stack event into CI logs so the real cause is visible without a manual console dig.

CREATE_FAILED / ROLLBACK_COMPLETE — the underlying CloudFormation state the waiter detected.
ResourceNotReady — the SDK exception form of a waiter terminal failure.
Rate exceeded — throttling during polling can also surface around long waits.
InsufficientInstanceCapacity / quota errors — common root causes that leave a resource stuck.

Frequently Asked Questions

Does the waiter error tell me what went wrong? Not directly. It only says the resource did not reach the desired state in time. The waiter is just a polling loop with a budget and has no visibility into the cause. To learn what happened, inspect the resource’s own status and events with the relevant describe-* command.

Max attempts exceeded vs. terminal failure state — what’s the difference? “Max attempts exceeded” means the waiter used up its poll budget while the resource was still in progress — it may yet succeed if given more time. “Terminal failure state” means the resource entered a definitive failure (like ROLLBACK_COMPLETE). The first is a timing problem; the second is a real failure that more time will never fix.

Should I just increase the timeout? Only if the resource is genuinely still provisioning — confirm the current state with a describe-* call first. If it has already failed or is rolling back, more time will not help; fix the underlying cause and re-create it. Raising max_attempts on a failed resource just makes your pipeline fail slower.

Why does my stack hang until timeout with no error? Almost always a custom resource Lambda never sent a response back to CloudFormation, so the service waits the full custom-resource window before failing. Ensure the handler always signals success or failure, even on an exception.

How do I find the real cause in CloudFormation? Read describe-stack-events and look at the earliest *_FAILED event’s reason. See the AWS guides for stack-debugging patterns.

AWS Error Guide: 'Waiter ... failed: Max attempts exceeded' ResourceNotReady Timeouts

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit

Exact Error Message

What the Error Means

Common Causes

How to Reproduce the Error

Diagnostic Commands

Step-by-Step Resolution

Prevention and Best Practices

Related Errors

Frequently Asked Questions

Download the Free 500-Prompt DevOps AI Toolkit