Building Grafana Dashboards People Actually Use

Every team I’ve joined has a folder of forty Grafana dashboards and nobody can tell me which one to open during an incident. They were built by accretion — someone added a panel, then another, until each dashboard became a wall of graphs that answers no question in particular. A good dashboard is a tool for a job. After building observability for a lot of teams, here’s how I make dashboards people actually open.

Start with the question, not the metric

The mistake is building a dashboard around what you can graph. The fix is building it around a question someone asks under pressure.

Before I add a single panel, I write the question at the top: “Is the checkout service healthy right now?” Every panel must help answer it. If a panel doesn’t move me toward an answer, it’s clutter, and clutter during an incident is actively harmful — it’s something your eye has to skip past at 3am.

The four signals that belong on every service dashboard

For any request-driven service, four panels tell you almost everything. These map to the classic “golden signals”:

Traffic — requests per second. sum(rate(http_requests_total[5m]))
Errors — error rate as a fraction. The ratio, not the raw count.
Latency — p50, p95, p99 from histograms.
Saturation — how full the resource is: CPU, memory, queue depth, connection pool.

# error rate panel
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

# latency panel: three quantiles on one graph
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
histogram_quantile(0.50, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

Put these four at the top, big, in a row. That row is the answer to “is it healthy.” Everything below is for digging deeper once the top row tells you something’s wrong.

Use template variables instead of duplicating dashboards

The worst anti-pattern is one dashboard per service, all identical. Use a template variable instead. One dashboard, a dropdown to pick the service.

# variable "service": query for label values
label_values(http_requests_total, service)

Then every panel filters by service="$service". Now you maintain one dashboard, and adding a new service requires zero work — it just shows up in the dropdown. The same trick works for environment, cluster, and region. Chain them: pick cluster, then the service list filters to that cluster.

Make units and thresholds honest

A latency panel labeled in raw seconds with a y-axis that auto-scales is useless at a glance. Set the unit (Grafana knows “seconds,” “bytes/sec,” “percent”). Set a threshold line at your SLO so a glance tells you “above or below the line.”

For an error-rate panel, a red threshold at your alert level means anyone can read it without knowing the numbers. The dashboard should be legible to someone who’s never seen it before, because during a cross-team incident, that’s exactly who’s looking.

Time range and refresh: match the use

An overview dashboard wants a wide window — 6 or 24 hours — so you see trends and the deploy that started the trouble. A live-incident panel wants 15 minutes and a fast refresh. Set sensible defaults; don’t make the on-call person fiddle with the time picker while production burns.

Avoid aggressive auto-refresh on heavy dashboards, though. A 5-second refresh on a 24-hour range with twenty panels hammers your Prometheus and can make a stressed system worse. 30s or 1m is plenty for most cases.

Annotate deploys

The most valuable line on any dashboard is a vertical marker showing when you deployed. “Latency spiked” is interesting; “latency spiked four minutes after the 14:02 deploy” is an answer. Wire your CI/CD to post Grafana annotations, or use a query-based annotation from a deploy metric. It turns correlation hunting into a glance.

Version your dashboards as code

Click-ops dashboards rot. Someone tweaks a panel, no one knows what changed, and the carefully-built thing degrades. Export dashboard JSON and commit it. Provision it from a file or Terraform. Now changes go through review, and a broken dashboard is a git revert.

# Grafana provisioning
apiVersion: 1
providers:
  - name: 'service-dashboards'
    folder: 'Services'
    type: file
    options:
      path: /etc/grafana/dashboards

Where AI helps

Dashboard JSON is verbose and nobody enjoys hand-authoring it. I describe what I want — “a service dashboard with the four golden signals up top, a service template variable, latency in milliseconds with an SLO threshold line” — and let AI generate the panel queries and a JSON skeleton. I import it, fix the metric names to match reality, and iterate.

The real win is the PromQL behind each panel: getting the rate-before-sum and histogram_quantile patterns right is fiddly, and AI drafts them cleanly. We keep monitoring prompts for exactly this, and the queries our Alert Rule Generator produces drop straight into Grafana panels.

The test of a good dashboard

Open it cold, during a (simulated) incident, and time how long it takes to answer “is this healthy, and if not, where’s the problem?” If the answer is more than ten seconds, the dashboard is doing its job poorly. Strip panels, fix units, hoist the golden signals to the top, and try again.

A dashboard isn’t a trophy case for every metric you collect. It’s a question-answering machine. Build it for the question, and people will actually open it when it counts.

Generated dashboards and queries are assistive, not authoritative. Always verify panel queries against your real metrics before relying on a dashboard during an incident.