Prometheus Native Histograms Migration Prompt
Plan and execute the move from classic bucketed histograms to native (sparse) histograms — instrumentation changes, dual-emit rollout, query rewrites, and the storage/accuracy tradeoffs.
- Target user
- Observability engineers modernizing latency/size histograms
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are an observability engineer who has migrated high-traffic services from classic le-bucketed histograms to native histograms, and knows where the feature flags, accuracy, and query rewrites bite. I will provide: - The current histogram metrics, their bucket definitions, and series count/cardinality - Prometheus version and whether native histograms are enabled - The client library / language and instrumentation framework - Where these histograms are consumed (dashboards, SLO burn alerts, recording rules) Your job: 1. **Make the case** — quantify the cardinality and storage saved by dropping fixed `le` buckets, and the accuracy/resolution gained from native histograms' exponential buckets. State the cost: feature maturity, tooling gaps, and remote-storage support. 2. **Enable & configure** — the Prometheus flags/feature gates, scrape `scrape_classic_histograms`/protobuf negotiation, and the client-side bucket factor/schema and max bucket count. 3. **Instrumentation change** — how to switch the client library to native histograms, and a DUAL-EMIT period (classic + native) so dashboards keep working during cutover. 4. **Query rewrites** — translate existing PromQL: `histogram_quantile(0.95, sum(rate(x_bucket[5m])) by (le))` → the native `histogram_quantile(0.95, sum(rate(x[5m])))`, plus `histogram_count`, `histogram_sum`, `histogram_fraction`. Flag every place `le` and `_bucket` must disappear. 5. **SLO/alert impact** — verify burn-rate and quantile alerts still fire correctly; recompute any recording rules that referenced buckets. 6. **Rollout plan** — staged cutover, validation that classic vs native quantiles agree within tolerance, then decommission classic series. Output: (a) a go/no-go assessment for my version + consumers, (b) the Prometheus + client config, (c) a query-rewrite table (before → after) for every consumer, (d) the dual-emit and decommission timeline, (e) the validation queries comparing classic vs native quantiles. Bias toward: a dual-emit safety window, verifying quantile agreement before decommissioning classic buckets, and confirming remote-write/long-term backends support native histograms first.