Skip to content
DevOps AI ToolKit
Newsletter
All guides
AWS with AI By James Joyner IV · · 9 min read

AWS Error Guide: 'Health checks failed' Unhealthy ELB Target Group Targets

Fix the AWS ELB 'Health checks failed' error: diagnose unhealthy target group targets, security groups, health check paths, and ECS deregistration timeouts.

  • #aws
  • #troubleshooting
  • #errors
  • #elb

Exact Error Message

service my-service (port 8080) is unhealthy in target-group my-tg due to (reason Health checks failed)

In the target group health view the targets show:

Target: i-0abcd1234  Port: 8080  Health status: unhealthy  Reason: Target.FailedHealthChecks  Description: Health checks failed

ECS deployments stall and roll back with tasks failed to start or failed container health checks traced to the same target group.

What the Error Means

An Elastic Load Balancer (ALB or NLB) routes traffic only to healthy targets. Each target group periodically probes its registered targets — for an ALB, an HTTP/HTTPS request to a configured path expecting a success code in the Matcher; for an NLB, a TCP connection or HTTP/HTTPS probe. The load balancer counts consecutive successes and failures against the HealthyThresholdCount and UnhealthyThresholdCount, so a target does not flip on a single bad response. Health checks failed (Target.FailedHealthChecks) means the probe ran but did not get the expected response within the timeout for enough consecutive attempts to cross the unhealthy threshold, so the load balancer marks the target unhealthy and stops routing to it. If every target is unhealthy at once, the listener returns 503 Service Unavailable.

This is a connectivity-or-response problem between the load balancer and the target, not an authentication problem. The probe originates from the load balancer’s own network interfaces inside your VPC, so security groups, subnet routing, and the app’s readiness all matter, and the fix is almost always in networking or app config rather than permissions.

Common Causes

  • Security group blocks the health check. The target’s security group does not allow the health check port from the load balancer’s security group. This is the most common cause and produces a Target.Timeout reason because the probe cannot open a connection.
  • Wrong health check path or port. The check hits / but the app serves health at /healthz, or the check port differs from the traffic port. A bad path returns the app’s 404 page, which the Matcher treats as a failure.
  • App not listening yet. The container or instance is slow to boot and fails checks before it is ready, and the startup grace period is too short, so the target never accumulates enough successes before deploy logic gives up.
  • Unexpected status code. The app returns 301/302/403 where the check expects 200, often from an HTTP-to-HTTPS redirect or auth middleware intercepting the probe.
  • Health endpoint depends on a failing dependency. A deep health check queries a database or downstream and fails when that dependency is degraded, taking otherwise-fine targets out.
  • Wrong subnet/AZ wiring or NACL. Network ACLs, missing routes, or a target in an unreachable subnet block the probe even when the security group is correct.
  • NLB target type mismatch. An instance-vs-IP mismatch, missing cross-zone load balancing, or client IP preservation can leave targets unreachable.

How to Reproduce the Error

Register a target whose security group does not allow the health check port, or point the check at a path the app does not serve:

# Inspect the health check config that will fail (read-only)
aws elbv2 describe-target-groups --names my-tg \
  --query 'TargetGroups[0].[HealthCheckPath,HealthCheckPort,Matcher.HttpCode]' \
  --output text

If the app listens on 8080 but the security group only allows 80 from the LB, the targets transition to unhealthy within a couple of intervals and the listener begins returning 503.

Diagnostic Commands

Confirm caller and region first:

aws sts get-caller-identity

Show the live health state and the exact failure reason for every target:

aws elbv2 describe-target-health --target-group-arn \
  arn:aws:elasticloadbalancing:us-east-1:123456789012:targetgroup/my-tg/abcd1234 \
  --query 'TargetHealthDescriptions[].[Target.Id,TargetHealth.State,TargetHealth.Reason,TargetHealth.Description]' \
  --output text

Review the health check parameters that must match the app:

aws elbv2 describe-target-groups --names my-tg \
  --query 'TargetGroups[0].[HealthCheckProtocol,HealthCheckPort,HealthCheckPath,HealthCheckIntervalSeconds,HealthyThresholdCount,UnhealthyThresholdCount,Matcher.HttpCode]' \
  --output json

Check the security groups on the target and the load balancer:

aws elbv2 describe-load-balancers --names my-alb \
  --query 'LoadBalancers[0].SecurityGroups' --output text
aws ec2 describe-security-groups --group-ids sg-0target1234 \
  --query 'SecurityGroups[0].IpPermissions' --output json

Step-by-Step Resolution

  1. Read the reason code from describe-target-health before changing anything — it tells you which layer to fix. Target.FailedHealthChecks means the probe ran but the response was wrong; Target.Timeout means it could not connect, pointing at network or security-group issues; Target.ResponseCodeMismatch means the target answered with a status code outside the Matcher. Treating a Timeout as a code problem, or vice versa, is the usual reason a fix fails to land.

  2. Fix connectivity (Timeout). Ensure the target’s security group allows the health check port from the load balancer’s security group by referencing the LB’s group as the source rather than a wide CIDR. Confirm NACLs allow both the inbound probe and the ephemeral return traffic, and that routes connect the LB’s subnets to the target’s.

  3. Fix the probe target (FailedHealthChecks/Mismatch). Align HealthCheckPath, HealthCheckPort, and Matcher.HttpCode with what the app serves and returns. Curl the endpoint from a host in the same subnet to see the real status code, and ensure no redirect or auth layer rewrites the request first.

  4. Give slow apps time. Increase the health check interval and UnhealthyThresholdCount so a brief startup hiccup does not evict a target, and for ECS raise deregistration_delay and healthCheckGracePeriodSeconds to cover the full cold start so new tasks are not killed before they finish booting.

  5. Make health endpoints shallow where possible so an unrelated dependency outage does not flap targets. A liveness check should confirm only that the process is up; reserve deep dependency checks for a separate readiness signal.

  6. Verify by re-running describe-target-health until targets report healthy and stay there, then confirm the listener stops returning 503.

Prevention and Best Practices

  • Define a dedicated, lightweight health endpoint (for example /healthz) that returns 200 quickly and does not chain to fragile dependencies, so probe success reflects the target’s own state.
  • Open the health check port from the load balancer’s security group, not from broad CIDRs. Referencing the LB security group as the source keeps the rule correct as its private IPs change.
  • Set the health check Matcher, path, and port deliberately rather than relying on defaults; the default / and 200 rarely match a real application and silently cause failures.
  • Tune thresholds and grace periods to your app’s real startup time to avoid premature unhealthy flips during deploys; measure cold-start time under load rather than guessing.
  • Alarm on UnHealthyHostCount and HealthyHostCount in CloudWatch so you catch flapping or a shrinking pool before it becomes a 503.
  • Keep deregistration delay long enough to drain in-flight requests but short enough to deploy promptly; 30 seconds suits most stateless services.
  • Exempt the health check path from any HTTP-to-HTTPS redirect and authentication middleware so the probe reaches your handler cleanly.
  • 503 Service Unavailable — what clients see when all targets in a group are unhealthy.
  • Target.Timeout — the probe could not connect at all, almost always a security group or network issue.
  • Target.ResponseCodeMismatch — the app responded but with a status code outside the Matcher.
  • ECS failed container health checks — the container-level health check (separate from the ELB check) failed.

Frequently Asked Questions

Why is my listener returning 503 but the app works locally? Working locally only proves the app runs; it says nothing about whether the load balancer can reach and probe it inside the VPC. When every target is unhealthy the LB returns 503. Check describe-target-health for the reason.

Timeout vs. FailedHealthChecks — what’s the difference? Timeout means the probe could not connect at all, almost always a security group, NACL, or routing problem. FailedHealthChecks and ResponseCodeMismatch mean it connected but the path was wrong or the status code fell outside the Matcher. The two demand fixes in different layers, so always read the reason first.

My ECS task keeps deregistering during deploys — why? The grace period is likely too short, so the ELB marks brand-new tasks unhealthy before the app finishes booting and ECS kills them in a loop. Raise healthCheckGracePeriodSeconds to cover the full cold start and increase the unhealthy threshold.

Should the health check port differ from the traffic port? It can — a sidecar or admin port is a valid pattern — but each port must be allowed by the security group from the load balancer and served by the app. Mismatches between the check, the listener, and the security group rule are a frequent, easily overlooked cause.

Does an NLB do HTTP health checks? NLBs support TCP, HTTP, and HTTPS checks; choose HTTP/HTTPS to validate the application response rather than just an open port, and ensure the target type matches your wiring. See more AWS guides.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.