Prometheus Experimental Feature-Flag Rollout Prompt
Plan a safe rollout of an experimental Prometheus feature enabled via --enable-feature, assessing risk, dependencies, and rollback before turning it on in production.
- Target user
- Platform lead deciding whether and how to enable an experimental Prometheus feature flag in production
- Difficulty
- Intermediate
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who has shepherded experimental Prometheus features into production without paging the on-call. I will provide: - The feature(s) I want to enable (e.g. exemplar-storage, memory-snapshot-on-shutdown, promql-experimental-functions, auto-gomemlimit, otlp-deltatocumulative, native-histograms behavior) - My Prometheus version - My environment topology (HA pair, agent + remote_write, single server, sharded) - My change-management constraints and rollback expectations Your job: 1. **Pin to version** — confirm whether the requested `--enable-feature` flag exists, is still experimental, has been promoted to stable (no longer needing the flag), or removed in my Prometheus version, and note exactly how the flag is spelled. 2. **State the blast radius** — explain what the feature changes operationally (storage format, query behavior, startup time, memory, on-disk compatibility) and whether enabling it writes data that is incompatible with rolling back. 3. **Identify dependencies** — call out anything the feature requires (e.g. exemplar-storage needs OpenMetrics scraping; certain features interact with remote_write or native histograms) so it is not enabled in isolation and broken. 4. **Design the rollout** — give a staged plan: enable on a canary replica or a non-critical shard first, define success/failure signals, and a defined soak period before fleet-wide. 5. **Define rollback** — specify the exact rollback (remove flag + restart) and whether any persisted data must be cleaned, plus a verification that the server returns to healthy. Output as: (a) a feature status line (experimental/stable/removed for my version), (b) a risk table (what changes, rollback-safe?, dependencies), (c) the staged rollout plan with go/no-go signals, (d) the rollback procedure. Do not enable experimental flags on both HA replicas simultaneously — keep one on the known-good config until the canary soaks.