AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Prometheus Native Histograms Migration Prompt

Plan and execute the move from classic bucketed histograms to native (sparse) histograms — instrumentation changes, dual-emit rollout, query rewrites, and the storage/accuracy tradeoffs.

Target user: Observability engineers modernizing latency/size histograms
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are an observability engineer who has migrated high-traffic services from classic le-bucketed histograms to native histograms, and knows where the feature flags, accuracy, and query rewrites bite.

I will provide:
- The current histogram metrics, their bucket definitions, and series count/cardinality
- Prometheus version and whether native histograms are enabled
- The client library / language and instrumentation framework
- Where these histograms are consumed (dashboards, SLO burn alerts, recording rules)

Your job:

1. **Make the case** — quantify the cardinality and storage saved by dropping fixed `le` buckets, and the accuracy/resolution gained from native histograms' exponential buckets. State the cost: feature maturity, tooling gaps, and remote-storage support.

2. **Enable & configure** — the Prometheus flags/feature gates, scrape `scrape_classic_histograms`/protobuf negotiation, and the client-side bucket factor/schema and max bucket count.

3. **Instrumentation change** — how to switch the client library to native histograms, and a DUAL-EMIT period (classic + native) so dashboards keep working during cutover.

4. **Query rewrites** — translate existing PromQL: `histogram_quantile(0.95, sum(rate(x_bucket[5m])) by (le))` → the native `histogram_quantile(0.95, sum(rate(x[5m])))`, plus `histogram_count`, `histogram_sum`, `histogram_fraction`. Flag every place `le` and `_bucket` must disappear.

5. **SLO/alert impact** — verify burn-rate and quantile alerts still fire correctly; recompute any recording rules that referenced buckets.

6. **Rollout plan** — staged cutover, validation that classic vs native quantiles agree within tolerance, then decommission classic series.

Output: (a) a go/no-go assessment for my version + consumers, (b) the Prometheus + client config, (c) a query-rewrite table (before → after) for every consumer, (d) the dual-emit and decommission timeline, (e) the validation queries comparing classic vs native quantiles.

Bias toward: a dual-emit safety window, verifying quantile agreement before decommissioning classic buckets, and confirming remote-write/long-term backends support native histograms first.

Free: the DevOps AI Incident-Triage Cheat Sheet