Metric Naming Standards That Keep Prometheus Sane
Inconsistent metric names turn dashboards and alerts into archaeology. A naming convention for units, suffixes, and labels makes every metric predictable and queryable.
- #prometheus
- #metric-naming
- #instrumentation
- #standards
- #promql
- #observability
I once inherited a Prometheus deployment where the same concept — request latency — was instrumented four different ways: request_time, http_latency_ms, api_duration, and responseTimeSeconds. Building a cross-service dashboard meant memorizing all four and converting milliseconds to seconds in PromQL by hand. The lesson stuck: metric naming is not bikeshedding. A consistent convention is the difference between a query you can guess and a metric you have to go read source code to find. Here’s the standard I enforce now.
The base unit rule: seconds and bytes, always
The single highest-value convention: always use base units, never prefixed ones. Seconds, not milliseconds. Bytes, not megabytes. Ratios as 0-1, not percentages.
http_request_duration_seconds # not _milliseconds
memory_usage_bytes # not _megabytes
cache_hit_ratio # 0.0 - 1.0, not 0 - 100
Why? Because Prometheus’ functions and Grafana’s unit formatting assume base units. If your latency is in seconds, Grafana auto-formats it as “340ms” or “1.2s” intelligently. If it’s in milliseconds, you fight the display layer forever, and every cross-metric calculation needs a conversion factor you’ll get wrong at least once. Base units are non-negotiable.
The suffix convention encodes the metric type
Prometheus has a strong convention that the name’s suffix tells you what kind of metric it is and how to query it:
_total— a monotonic counter. You almost always wrap it inrate()._seconds/_bytes— the base unit, for gauges and counters alike._seconds_total— a counter accumulating time._bucket,_sum,_count— the auto-generated parts of a histogram._ratio/_info— a 0-1 ratio, or a metadata-only_infometric.
http_requests_total # counter -> rate()
http_request_duration_seconds_bucket # histogram
node_filesystem_avail_bytes # gauge
process_cpu_seconds_total # time counter
When someone sees http_requests_total they know without checking the docs to write rate(http_requests_total[5m]). That predictability is the entire point — the name is the documentation.
Namespacing: prefix by subsystem
Prefix every metric with the application or subsystem it belongs to. This prevents collisions and makes autocomplete useful:
payments_transactions_total
payments_gateway_latency_seconds
checkout_cart_items_count
In a TSDB with thousands of metric names, typing payments_ and getting only payments metrics is a real productivity gain. The prefix is part of the name, not a label — names are for what the metric is, labels are for dimensions of it.
The name-vs-label decision
The most consequential naming choice is what goes in the name versus what goes in a label. The rule: the name is the measurement; labels are the dimensions you slice by.
Right:
http_requests_total{method="GET", status="200", handler="/api/users"}
Wrong — encoding dimensions into the name:
http_requests_get_200_api_users_total
The wrong version is un-aggregatable. With proper labels you can sum by (status) or filter status=~"5..". With everything in the name you have a thousand un-relatable metrics. If you’d ever want to aggregate across a dimension, it’s a label.
But the inverse failure is just as bad: never put unbounded values in labels. User IDs, request IDs, full URLs, timestamps — these explode cardinality and can take down your Prometheus. A label’s value set should be small and bounded. status (a dozen values) is a great label; user_id (millions) is a TSDB-killer.
A label hygiene checklist
For every label, ask:
- Is the value set bounded? If it can grow without limit, it’s not a label.
- Will I ever group or filter by it? If never, it’s noise — drop it.
- Is it consistent across services? A
servicelabel spelledsvcin one place andservicein another breaks joins. Standardize the common dimensions:service,env,region,status,method.
# this join only works if both metrics use the same `service` label name
sum by (service) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (service) (rate(http_requests_total[5m]))
A single inconsistent label name silently breaks group_left joins across teams. Agree on the shared dimension names once, document them, and enforce in review.
Don’t repeat the prefix or the unit in labels
Two small but common smells:
# redundant: the metric is already namespaced "payments"
payments_latency_seconds{payments_region="us"} # bad
payments_latency_seconds{region="us"} # good
# redundant: unit is in the name
disk_usage_bytes{unit="bytes"} # bad, drop the label
Every redundant label is cardinality and clutter for zero query value.
Enforcing it without nagging
Conventions only hold if they’re checked. Two cheap enforcement points:
promtool check metricsin CI lints exposition for some convention violations and catches obvious mistakes before they ship.- A shared instrumentation library. The most reliable enforcement is to not let engineers name metrics freehand — wrap your metric registration in a helper that applies the prefix and validates the suffix. Convention by construction beats convention by code review.
A relabeling pass can also retrofit standards onto third-party exporters you don’t control, renaming or dropping labels at scrape time.
Why it compounds
A naming convention feels like overhead on day one and pays dividends every day after. Engineers guess metric names correctly. Cross-service dashboards work without conversion math. Alert rules read cleanly. New hires onboard faster because the system is predictable. The cost is a one-page standard and a little discipline in review — cheap insurance against the four-different-names-for-latency mess I inherited.
For the cardinality control that complements good naming, see our metric-cardinality and relabeling guides in the Prometheus and monitoring category. And when inconsistent labels are breaking your alert joins, our monitoring alert assistant can flag the mismatches.
The conventions here follow common Prometheus community practice. Adapt the shared label names to whatever your organization standardizes on.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.