Grafana Error Guide: CloudWatch 'Rate exceeded' — Throttling the Data Source
Fix Grafana CloudWatch 'Rate exceeded' throttling errors — reduce GetMetricData API calls, raise account API limits, tune intervals and dashboards, and add retries so panels stop failing.
- #grafana
- #troubleshooting
- #errors
- #cloudwatch
Overview
Grafana’s CloudWatch data source calls the AWS GetMetricData (and ListMetrics) APIs, which are rate-limited per account per region. When too many queries fire at once — many panels, short refresh intervals, many dashboards, or many concurrent users — AWS returns a ThrottlingException: Rate exceeded and the affected panels show no data. It’s a client-side/AWS-quota problem, not a Grafana bug.
The literal errors you will see:
ThrottlingException: Rate exceeded
CloudWatch error: Throttling: Rate exceeded status code: 400
logger=tsdb.cloudwatch error="failed to execute query" err="ThrottlingException: Rate exceeded"
It occurs on dashboard load/refresh, worse during incidents when everyone opens CloudWatch dashboards at once.
Symptoms
- CloudWatch panels intermittently show “Rate exceeded” / no data, then recover.
- More panels fail as dashboards get larger or refresh faster.
- Errors spike when many users load CloudWatch dashboards simultaneously.
- AWS
ThrottledCount/CallCountmetrics forGetMetricDataclimb.
journalctl -u grafana-server --no-pager | grep -i "ThrottlingException\|Rate exceeded" | tail
logger=tsdb.cloudwatch error="ThrottlingException: Rate exceeded"
Common Root Causes
1. Too many GetMetricData calls per refresh
Each panel/query maps to GetMetricData work. Big dashboards with many panels and short refresh intervals blow through the per-second transactions-per-second (TPS) limit.
2. Refresh interval too aggressive
Dashboards set to refresh every 5–10s multiply the call rate; a heavy CloudWatch dashboard on a fast refresh throttles quickly.
3. Many concurrent users / dashboards
During an incident, dozens of people open the same CloudWatch dashboards, and combined they exceed the account’s regional API quota.
4. High-cardinality queries / wildcards
Broad dimension wildcards or * on ListMetrics fan out into many metrics and many data-point requests.
5. Default account API limit too low
The account’s CloudWatch API request-rate quota hasn’t been raised for the monitoring workload; the ceiling is simply too low.
Diagnostic Workflow
Step 1: Confirm throttling in Grafana logs
journalctl -u grafana-server --no-pager | grep -iE "ThrottlingException|Rate exceeded" | tail -20
docker logs grafana 2>&1 | grep -iE "Throttling|Rate exceeded" | tail
Step 2: Measure the API call rate from AWS side
aws cloudwatch get-metric-statistics \
--namespace AWS/Usage \
--metric-name CallCount \
--dimensions Name=Type,Value=API Name=Resource,Value=GetMetricData Name=Service,Value=CloudWatch Name=Class,Value=None \
--start-time "$(date -u -d '1 hour ago' +%FT%TZ)" \
--end-time "$(date -u +%FT%TZ)" \
--period 60 --statistics Sum --region us-east-1
Compare the per-minute Sum against your account’s GetMetricData quota.
Step 3: Check the data source config
# Grafana provisioned CloudWatch data source
apiVersion: 1
datasources:
- name: CloudWatch
type: cloudwatch
jsonData:
authType: default # or keys / arn
defaultRegion: us-east-1
Step 4: Reduce demand on dashboards
- Raise dashboard refresh intervals (e.g. 1m instead of 5s).
- Split huge dashboards; reduce panels per dashboard.
- Replace dimension wildcards with explicit dimensions.
- Increase the min interval so fewer data points are requested.
Step 5: Request an AWS quota increase
aws service-quotas get-service-quota \
--service-code monitoring \
--quota-code L-yyyyyyyy --region us-east-1
# then request-service-quota-increase for the GetMetricData transaction rate
Example Root Cause Analysis
During an outage, the primary CloudWatch dashboard shows scattered “Rate exceeded” panels while 30 responders have it open. Grafana log:
logger=tsdb.cloudwatch error="ThrottlingException: Rate exceeded status code: 400"
The dashboard has 40 CloudWatch panels and a 10s refresh. With 30 concurrent viewers that’s a very high GetMetricData rate. The AWS CallCount for GetMetricData confirms the account is at its regional TPS quota:
GetMetricData CallCount (per min): 48200 # over quota during the incident
Two fixes applied together:
- Reduce demand — raise the dashboard refresh to 1m, split the 40 panels into two focused dashboards, and set a sensible min interval.
- Raise the ceiling — request a GetMetricData transaction-rate quota increase for the region.
After the refresh change the throttling stops immediately (call rate drops ~6x), and the quota increase gives headroom for future incidents. Root cause: aggressive refresh × many panels × many concurrent viewers exceeding the account’s GetMetricData rate — not a Grafana defect.
Prevention Best Practices
- Use conservative dashboard refresh intervals (1m+) for CloudWatch; avoid 5–10s on heavy dashboards.
- Keep CloudWatch dashboards lean: fewer panels, explicit dimensions instead of wildcards, sensible min interval.
- Consider metric streams / exporting CloudWatch metrics to Prometheus for high-traffic dashboards to offload the API.
- Monitor your own GetMetricData
CallCount/ThrottledCountand alert before hitting the quota. - Request a Service Quotas increase for the CloudWatch API transaction rate ahead of growth.
- See more Grafana guides.
Quick Command Reference
# Throttling in Grafana logs
journalctl -u grafana-server | grep -iE "ThrottlingException|Rate exceeded" | tail
# GetMetricData call rate (AWS/Usage)
aws cloudwatch get-metric-statistics --namespace AWS/Usage \
--metric-name CallCount \
--dimensions Name=Type,Value=API Name=Resource,Value=GetMetricData \
Name=Service,Value=CloudWatch Name=Class,Value=None \
--start-time "$(date -u -d '1 hour ago' +%FT%TZ)" \
--end-time "$(date -u +%FT%TZ)" --period 60 --statistics Sum --region us-east-1
# Current quota
aws service-quotas get-service-quota --service-code monitoring \
--quota-code <code> --region us-east-1
Conclusion
CloudWatch “Rate exceeded” in Grafana is AWS throttling the GetMetricData API because the request rate is over quota. Typical root causes:
- Too many GetMetricData calls per refresh (big dashboards).
- Refresh intervals set too aggressively.
- Many concurrent users/dashboards during incidents.
- High-cardinality queries and dimension wildcards.
- A default account API quota that’s too low for the workload.
Confirm throttling in the logs, measure your GetMetricData call rate against the quota, then attack both sides — reduce demand (refresh, panels, wildcards) and raise the AWS quota.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.