AWS Error Guide: '503 SlowDown' and ServiceUnavailable S3 Request-Rate Failures
Fix S3 503 SlowDown and ServiceUnavailable errors: diagnose per-prefix request-rate limits, hot key partitions, missing retries, list storms, and bad key design.
- #aws
- #troubleshooting
- #errors
- #s3
Overview
503 SlowDown (and the related 503 ServiceUnavailable) is S3 telling you to back off: your request rate against a key prefix is climbing faster than the partition behind it can scale. S3 scales request capacity per prefix automatically, but scaling is gradual — a sudden burst against one prefix, or a workload concentrated on a single “hot” key range, outpaces the partition and gets throttled with a retryable 503.
You see it from the CLI or an SDK:
An error occurred (SlowDown) when calling the PutObject operation (reached max retries: 4): Please reduce your request rate.
Or the transient infrastructure variant:
An error occurred (503) when calling the GetObject operation: Service Unavailable
It occurs on PutObject, GetObject, DeleteObject, ListObjectsV2, and multipart operations — most often during bulk ingest, large parallel copies/migrations, or analytics jobs that read thousands of objects under one prefix.
Symptoms
- Bulk upload/download jobs intermittently fail with
SlowDown/503and slow down under load. - The same operation succeeds at low concurrency but fails when parallelized.
- CloudWatch S3 request metrics show
5xxErrorsrising with request count. - Errors cluster on one prefix while other prefixes are fine.
aws s3 cp ./batch/ s3://data-lake/ingest/2026-06-23/ --recursive
upload failed: ./batch/f8123.json to s3://data-lake/ingest/2026-06-23/f8123.json An error occurred (SlowDown) when calling the PutObject operation: Please reduce your request rate.
aws cloudwatch get-metric-statistics --namespace AWS/S3 --metric-name 5xxErrors \
--dimensions Name=BucketName,Value=data-lake Name=FilterId,Value=EntireBucket \
--start-time "$(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --period 300 --statistics Sum \
--query 'Datapoints[].Sum' --output text
0.0 0.0 142.0 318.0
Common Root Causes
1. Burst against a single prefix
All requests target one prefix (e.g. a date folder) faster than S3 can scale that partition. The 5xx error count rises in lockstep with request rate on that prefix.
aws cloudwatch get-metric-statistics --namespace AWS/S3 --metric-name AllRequests \
--dimensions Name=BucketName,Value=data-lake Name=FilterId,Value=EntireBucket \
--start-time "$(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --period 300 --statistics Sum \
--query 'Datapoints[].Sum' --output text
1200.0 1180.0 9800.0 14200.0
Request volume spiking to ~14k/period on one prefix outpaces the partition — spread the load across prefixes.
2. Hot key / sequential-prefix design
Keys with a common high-cardinality-last prefix (timestamps, sequential IDs) concentrate writes on one partition. S3 can no longer split the load by leading prefix characters.
aws s3api list-objects-v2 --bucket data-lake --prefix ingest/2026-06-23/ \
--query 'length(Contents)' --output text
48211
Tens of thousands of objects under one sequential date prefix is a classic hot-prefix pattern. Add a high-entropy prefix segment (e.g. a hash) to distribute.
3. Missing or weak client retries
S3 503s are explicitly retryable with backoff. A client that does not retry (or retries with no backoff) surfaces every transient throttle as a hard failure.
aws configure get retry_mode; aws configure get max_attempts
legacy
3
legacy retry mode has minimal backoff for 503; switch to standard/adaptive and raise max_attempts.
4. Excessive concurrency saturating the prefix
A high parallelism setting (many threads/workers) drives the per-prefix rate past what scaling can absorb. More concurrency on one prefix does not raise the limit — it trips it sooner.
aws configure get s3.max_concurrent_requests
40
40 concurrent requests all hitting one prefix can overwhelm a fresh partition; lower it or fan out across prefixes.
5. List-heavy workloads on a large prefix
Frequent ListObjectsV2 over a prefix with millions of objects is expensive and contributes to the request rate, compounding throttling during reads/writes.
aws cloudwatch get-metric-statistics --namespace AWS/S3 --metric-name ListRequests \
--dimensions Name=BucketName,Value=data-lake Name=FilterId,Value=EntireBucket \
--start-time "$(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --period 300 --statistics Sum \
--query 'Datapoints[].Sum' --output text
60.0 58.0 4200.0 5100.0
Thousands of LIST calls indicate a job enumerating a huge prefix repeatedly — use an inventory/manifest instead.
6. A transient S3 ServiceUnavailable (infrastructure)
Occasionally the 503 is ServiceUnavailable from a brief internal hiccup, not your rate. It is rare, short-lived, and resolved purely by retry.
aws s3api head-object --bucket data-lake --key ingest/2026-06-23/f8123.json 2>&1
An error occurred (503) when calling the HeadObject operation: Service Unavailable
If request rates are modest and the error vanishes on retry, treat it as transient — robust retries handle it.
Diagnostic Workflow
Step 1: Confirm SlowDown vs. ServiceUnavailable and the prefix
aws s3 cp <SOURCE> s3://<BUCKET>/<PREFIX>/ --recursive 2>&1 | grep -oE '(SlowDown|Service Unavailable)'
SlowDown: Please reduce your request rate is a rate problem; Service Unavailable may be transient. Note the prefix in the failing keys.
Step 2: Correlate request rate with 5xx in CloudWatch
aws cloudwatch get-metric-statistics --namespace AWS/S3 --metric-name 5xxErrors \
--dimensions Name=BucketName,Value=<BUCKET> Name=FilterId,Value=EntireBucket \
--start-time "$(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --period 300 --statistics Sum \
--query 'Datapoints[].Sum' --output text
(Requires S3 request metrics enabled.) 5xx climbing with AllRequests confirms rate-driven throttling.
Step 3: Check object distribution under the hot prefix
aws s3api list-objects-v2 --bucket <BUCKET> --prefix <HOT_PREFIX>/ \
--query 'length(Contents)' --output text
A huge count under one sequential prefix signals a hot-partition design problem.
Step 4: Inspect client retry and concurrency settings
aws configure get retry_mode; aws configure get max_attempts
aws configure get s3.max_concurrent_requests
legacy retries and high concurrency on one prefix are the controllable contributors.
Step 5: Apply backoff / spread, then re-run
export AWS_RETRY_MODE=adaptive AWS_MAX_ATTEMPTS=8
aws configure set s3.max_concurrent_requests 10
aws s3 cp <SOURCE> s3://<BUCKET>/<PREFIX>/ --recursive
Adaptive retries plus reduced concurrency (and, longer term, more prefixes) clear the throttling.
Example Root Cause Analysis
A nightly ingest job writing sensor data to s3://data-lake/ingest/<date>/ began failing with SlowDown as data volume grew. All writes for a day landed under one date prefix.
CloudWatch showed 5xx tracking request rate, and the prefix held a huge object count:
aws s3api list-objects-v2 --bucket data-lake --prefix ingest/2026-06-23/ \
--query 'length(Contents)' --output text
612400
Over 600k objects written into one sequential date prefix with 40-way concurrency — the partition could not scale fast enough for the burst. Retry mode was also legacy:
aws configure get retry_mode
legacy
Fix (two parts): immediately, enable adaptive retries and cut concurrency so the job completes:
export AWS_RETRY_MODE=adaptive AWS_MAX_ATTEMPTS=8
aws configure set s3.max_concurrent_requests 12
And durably, change the key scheme to inject a high-entropy segment so writes spread across partitions:
ingest/2026-06-23/<2-char-hash>/<sensor-id>.json
After the key change, the same volume wrote without 503s because S3 split the load across many prefixes.
Prevention Best Practices
- Design keys to spread load: avoid pure sequential/timestamp prefixes for high-write workloads; inject a high-entropy segment (hash) so S3 can partition across prefixes.
- Always retry 503s with exponential backoff and jitter — use the SDK’s
adaptiveretry mode (AWS_RETRY_MODE=adaptive) rather than failing on the first throttle. - Tune concurrency to the prefix, not the machine; more parallel requests against one prefix trip the limit sooner, they do not raise it.
- Replace repeated
ListObjectsV2over huge prefixes with S3 Inventory or a stored manifest to cut request volume. - Enable S3 request metrics so you can correlate
5xxErrorswithAllRequestsand see which prefix is hot. - For correlating a 503 spike with request rate and prefix from the metrics, the free incident assistant can identify the hot prefix and the retry gap. More S3 walkthroughs are in the AWS guides.
Quick Command Reference
# Confirm SlowDown vs. ServiceUnavailable
aws s3 cp <SOURCE> s3://<BUCKET>/<PREFIX>/ --recursive 2>&1 | grep -oE '(SlowDown|Service Unavailable)'
# Correlate 5xx with request rate (needs request metrics)
aws cloudwatch get-metric-statistics --namespace AWS/S3 --metric-name 5xxErrors \
--dimensions Name=BucketName,Value=<BUCKET> Name=FilterId,Value=EntireBucket \
--start-time "$(date -u -d '30 min ago' +%Y-%m-%dT%H:%M:%SZ)" \
--end-time "$(date -u +%Y-%m-%dT%H:%M:%SZ)" --period 300 --statistics Sum --query 'Datapoints[].Sum' --output text
# Object count under the hot prefix
aws s3api list-objects-v2 --bucket <BUCKET> --prefix <HOT_PREFIX>/ --query 'length(Contents)' --output text
# Client retry and concurrency settings
aws configure get retry_mode; aws configure get max_attempts
aws configure get s3.max_concurrent_requests
# Re-run with backoff and lower concurrency
AWS_RETRY_MODE=adaptive AWS_MAX_ATTEMPTS=8 aws s3 cp <SOURCE> s3://<BUCKET>/<PREFIX>/ --recursive
Conclusion
503 SlowDown / ServiceUnavailable means your request rate against a prefix is outpacing S3’s per-partition scaling. The usual root causes:
- A burst against a single prefix faster than it can scale.
- A hot-key / sequential-prefix design concentrating load on one partition.
- Missing or weak client retries (503 is retryable).
- Excessive concurrency saturating one prefix.
- List-heavy workloads inflating the request rate.
- A genuinely transient
ServiceUnavailableresolved by retry.
Confirm it is rate-driven, add adaptive backoff, reduce concurrency, and spread keys across prefixes — durable fixes come from key design, not just retrying harder.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.