AI for Prometheus & Monitoring
Write better alert rules, PromQL queries, and Grafana dashboards with AI.
Prompts
- Advanced
Alertmanager Routing Tree Matcher Design Review Prompt
Design or review an Alertmanager routing tree — receivers, matchers, group_by, continue, and timers — so every alert reaches the right team exactly once without falling through to a catch-all.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Prometheus Dashboard Panel Query Design Prompt
Design Grafana panel PromQL with template variables, $__rate_interval, legend formatting, and unit/threshold choices so dashboards stay readable and don't hammer Prometheus on every refresh.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Active Series Cardinality Reduction Triage Prompt
Triage a TSDB active-series and head-memory blowup by finding the offending metric+label, deciding between drop relabeling, label aggregation, or instrumentation fixes, with a measurable before/after series count.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Missing Metric End-to-End Debugging Prompt
Walk a metric that isn't showing up in Prometheus through the full path — instrumentation, exposition, target discovery, scrape success, relabeling drops, and staleness — to pinpoint exactly where it vanishes.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Multi-Window Multi-Burn-Rate SLO Alert Authoring Prompt
Author a complete multi-window, multi-burn-rate SLO alerting ruleset (fast + slow burn pairs with for/severity) from an objective and error-budget window, balancing detection speed against false-page rate.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Recording Rule Hierarchy Design and Naming Prompt
Design a layered recording-rule hierarchy that precomputes expensive aggregations once, follows the level:metric:operations naming convention, and feeds dashboards, SLOs, and alerts from cheap series.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Scrape Config Relabel Target Pruning Design Prompt
Design relabel_configs in a scrape job to keep/drop the right targets from service discovery, rewrite the instance/job labels, and prune noisy discovered endpoints before they ever scrape.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL group_left Metadata Enrichment Join Prompt
Write a many-to-one PromQL join with group_left to enrich a metric with labels from an info/metadata series (kube_pod_info, *_build_info) without breaking vector matching or duplicating series.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL Latency SLI from Histograms Aggregation Design Prompt
Build a correct latency SLI/alert from Prometheus histogram metrics — aggregating buckets before histogram_quantile, choosing percentile vs threshold-ratio, and avoiding the average-of-percentiles trap.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL Rate Window vs Scrape Interval Mismatch Debugging Prompt
Diagnose why a rate() or increase() query returns gaps, zeros, jagged graphs, or NaN by reconciling the range window against the scrape interval, staleness, and counter reset behaviour.
- Claude
- ChatGPT
Open prompt - Intermediate
Alertmanager group_wait, group_interval & repeat_interval Tuning Prompt
Tune Alertmanager grouping and repeat timers so related alerts batch into one notification, follow-ups are timely, and re-pages don't become noise.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana $__rate_interval Correctness Review Prompt
Review Grafana panel queries that use rate() to confirm they use $__rate_interval correctly, so dashboards stay accurate across zoom levels and scrape intervals.
- Claude
- ChatGPT
- Cursor
Open prompt - Advanced
OpenTelemetry Collector batch & memory_limiter Processor Sizing Prompt
Size the OpenTelemetry Collector batch and memory_limiter processors so the pipeline batches efficiently, applies backpressure, and never OOMs under telemetry spikes.
- Claude
- ChatGPT
- Cursor
Open prompt - Advanced
Prometheus metric_relabel_configs Drop-List Cardinality Audit Prompt
Audit and generate metric_relabel_configs drop and keep rules that cut high-cardinality series at ingest without dropping metrics your alerts and dashboards depend on.
- Claude
- ChatGPT
- Cursor
Open prompt - Advanced
Prometheus query.max-samples, timeout & concurrency Tuning Prompt
Tune Prometheus server query limits (query.max-samples, query.timeout, query.max-concurrency) so heavy or runaway queries fail fast instead of OOMing the server.
- Claude
- ChatGPT
- Cursor
Open prompt - Advanced
PromQL Native Histogram histogram_count & histogram_sum Debugging Prompt
Debug quantiles, averages, and rates over Prometheus native histograms using histogram_count, histogram_sum, histogram_fraction, and histogram_quantile correctly.
- Claude
- ChatGPT
- Cursor
Open prompt - Advanced
PromQL quantile_over_time vs histogram_quantile Selection Prompt
Decide whether to compute a percentile with quantile_over_time over a gauge or with histogram_quantile over histogram buckets, and avoid the silent accuracy traps of each.
- Claude
- ChatGPT
Open prompt - Advanced
Thanos Store Gateway Index & Caching Tier Sizing Prompt
Size the Thanos Store Gateway index cache, bucket cache, and caching-bucket tiers so long-range queries are fast without exhausting memory or hammering object storage.
- Claude
- ChatGPT
- Cursor
Open prompt - Advanced
VictoriaMetrics vmagent Stream Aggregation Rules Design Prompt
Design vmagent stream aggregation rules that pre-aggregate high-cardinality metrics at ingest, cutting stored series while preserving the dimensions your queries need.
- Claude
- ChatGPT
- Cursor
Open prompt - Intermediate
Prometheus Experimental Feature-Flag Rollout Prompt
Plan a safe rollout of an experimental Prometheus feature enabled via --enable-feature, assessing risk, dependencies, and rollback before turning it on in production.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus honor_labels & honor_timestamps Conflict Resolution Prompt
Diagnose and fix label collisions and timestamp drift caused by honor_labels/honor_timestamps when scraping federation endpoints, Pushgateway, or exporters that expose their own job/instance labels.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus http_sd Dynamic Target Discovery Prompt
Design and debug an http_sd_config integration so Prometheus pulls its scrape targets from a custom HTTP discovery endpoint, with correct refresh, labeling, and failure handling.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Query Log Slow-Query Audit Prompt
Enable and analyze the Prometheus active query log and query_log_file to find expensive PromQL queries that strain the server, then rewrite or offload them.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus sample_limit Target Protection Prompt
Design per-target sample_limit guardrails that protect a Prometheus server from a single misbehaving exporter blowing up cardinality, without dropping legitimate metrics from healthy targets.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus scrape_protocols Content Negotiation Prompt
Configure and troubleshoot Prometheus scrape_protocols / content-type negotiation so the server requests the right exposition format (OpenMetrics, PrometheusText, PrometheusProto) and unlocks features like native histograms and created timestamps.
- Claude
- ChatGPT
Open prompt - Beginner
Prometheus target_limit & label_limit Guardrails Prompt
Configure target_limit, label_limit, label_name_length_limit, and label_value_length_limit to protect a Prometheus server from service-discovery explosions and abusive label sets in a multi-tenant environment.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus TSDB Head Memory & Series Churn Prompt
Diagnose Prometheus memory pressure driven by the in-memory head block, distinguishing high active-series load from high series churn, and applying the right remediation for each.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus WAL Replay Startup Latency Prompt
Diagnose and reduce slow Prometheus startup caused by long write-ahead-log (WAL) replay, so a restarting server returns to a healthy, scrapeable state quickly after deploys or crashes.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Exporter TLS & Auth Hardening Prompt
Secure exporter and scrape endpoints with TLS and authentication using Prometheus web-config and exporter web.config.file so metrics endpoints exposing internal labels and topology are no longer open on the network.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus External Labels & Multi-Cluster Collision Prompt
Design a coherent external_labels and identity scheme across many Prometheus instances so federation, remote-write, and global query layers never collide series, double-count, or lose the cluster/region dimension.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Histogram Bucket Boundary Design Prompt
Choose histogram bucket boundaries that match your SLO thresholds and latency distribution so quantile estimates are accurate where it matters, without exploding series cardinality from too many buckets.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Meta-Monitoring & Self-SLO Design Prompt
Build the monitoring-of-the-monitoring layer: alerts and SLOs that tell you when Prometheus itself is unhealthy — scrapes lagging, rules failing, WAL growing, or the whole instance dead — so your blind spots do not become silent outages.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Query API Read-Path Protection Prompt
Protect the Prometheus query API from runaway, expensive, or hostile queries using sample/time limits, query logging, timeouts, and a fronting proxy so one bad dashboard or ad-hoc query cannot OOM or stall the whole instance.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Recording Rule Layered Aggregation Prompt
Design a tiered hierarchy of recording rules — raw to job-level to service-level — that precompute hot aggregations once and reuse them, cutting dashboard and alert query cost without creating stale or circular rule dependencies.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Scrape Timeout & Slow Target Diagnosis Prompt
Diagnose targets that exceed scrape_timeout or return partial data — distinguishing a slow exporter from a slow network from too-large a payload — and fix it without simply raising the timeout until scrapes overlap.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus TSDB Snapshot Backup & Restore Prompt
Design a reliable backup and restore procedure for the Prometheus TSDB using the admin snapshot API, object-storage offload, and a tested recovery runbook so you can rebuild a server without silent data loss.
- Claude
- ChatGPT
Open prompt - Beginner
PromQL Clamp & Bounds Sanitization Review Prompt
Sanitize PromQL expressions that can produce misleading negatives, NaN, Inf, or out-of-range values using clamp, clamp_min/max, and division-guard patterns so dashboards and alerts never display or fire on mathematically impossible numbers.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Dashboard JSON Model Drift Review Prompt
Diff a Grafana dashboard JSON model against its provisioned/source-of-truth version to surface UI-edit drift, hardcoded datasource UIDs, broken variables, and schema-version risks.
- Claude
- ChatGPT
Open prompt - Beginner
Prometheus Config Reload Validation with promtool Prompt
Validate Prometheus and rule config changes with promtool check before a hot reload, and design a safe reload pipeline that fails closed on bad config.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Out-of-Order Sample Ingestion Tuning Prompt
Configure and tune out-of-order sample ingestion (tsdb.out_of_order_time_window) to accept delayed/backfilled samples without breaking compaction or exploding memory.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Rule Unit Testing with promtool Prompt
Generate promtool unit test files (test_rules.yml) that assert alert firing, recording-rule output, and label propagation for Prometheus rule groups in CI.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Target-Down & Scrape Failure Triage Prompt
Systematically triage why a Prometheus target shows up==0 or scrape errors — distinguishing network, TLS, auth, relabel-drop, and sample-limit causes from the target's scrape metadata.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus TSDB Block & Compaction Tuning Prompt
Tune TSDB block durations, head compaction, and retention so a high-cardinality Prometheus stays within memory and disk budgets without compaction stalls.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus WAL & TSDB Corruption Recovery Prompt
Diagnose and safely recover a Prometheus instance that fails to start or crash-loops due to WAL replay errors, corrupt blocks, or a full data directory.
- Claude
- ChatGPT
Open prompt - Intermediate
PromQL absent_over_time Gap Detection Prompt
Design absent()/absent_over_time() expressions that detect missing metrics, scrape gaps, and label-scoped absence without false-firing during restarts or expected idle periods.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL Counter-Reset Resilience Review Prompt
Audit rate()/increase() queries for counter-reset handling, extrapolation artifacts, range-window vs scrape-interval mismatches, and double-counting across HA replicas.
- Claude
- ChatGPT
Open prompt - Beginner
Alertmanager Routing Tree Dry-Run Testing Prompt
Validate an Alertmanager routing tree before deploy by simulating sample alerts through amtool config routes test, catching misrouted pages and unreachable receivers.
- Claude
- Copilot
Open prompt - Intermediate
Alertmanager Silence Automation via amtool & API Prompt
Automate creating, expiring, and auditing Alertmanager silences around deploys and maintenance windows using amtool and the v2 API, with matchers that don't over-silence.
- Claude
- Cursor
Open prompt - Intermediate
Dashboard Query to Recording Rule Offload Prompt
Identify slow, repeatedly-evaluated dashboard queries and convert them into precomputed recording rules to cut load times and TSDB read pressure.
- Claude
- Copilot
Open prompt - Advanced
Exporter Cardinality Budget & Label Allowlisting Prompt
Audit a custom or third-party exporter's emitted metrics, set a per-exporter cardinality budget, and apply label allowlisting via metric_relabel_configs to keep series under control.
- Claude
- ChatGPT
Open prompt - Intermediate
MetricsQL WITH Templates & Query Optimization Prompt
Refactor verbose, repetitive PromQL into clean, reusable MetricsQL using WITH templates and VictoriaMetrics-specific functions to cut query latency and duplication.
- Claude
- Cursor
Open prompt - Advanced
Prometheus Remote Write Queue & Backpressure Tuning Prompt
Diagnose remote_write lag, dropped samples, and WAL growth, then tune queue_config shards and batching to stabilize delivery to a long-term backend.
- Claude
- ChatGPT
Open prompt - Intermediate
PromQL Apdex Score & Latency Satisfaction Prompt
Build an Apdex-style satisfaction score from Prometheus histogram buckets to express latency SLOs in a single user-centric number for dashboards and alerts.
- Claude
- Gemini
Open prompt - Intermediate
Synthetic Monitoring Multi-Step Journey Checks Prompt
Design scripted multi-step synthetic checks (login, search, checkout) with Grafana Synthetic Monitoring or k6 browser, and wire the results into availability SLOs.
- Claude
- ChatGPT
Open prompt - Advanced
VictoriaMetrics Cardinality Explorer & TSDB Triage Prompt
Diagnose a VictoriaMetrics cluster suffering from high active time series and churn using the built-in Cardinality Explorer and TSDB status endpoints, then produce a prioritized remediation plan.
- Claude
- ChatGPT
Open prompt - Intermediate
Alertmanager PagerDuty Receiver Integration Prompt
Wire Prometheus Alertmanager to PagerDuty with correct severity mapping, dedup keys, custom details, and auto-resolve so on-call pages are actionable and noise-free.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana k6 Load Test Metrics Dashboard Prompt
Stream k6 load-test results into Prometheus and build a Grafana dashboard that correlates virtual-user load, latency percentiles, error rates, and system saturation during a test run.
- Claude
- ChatGPT
Open prompt - Advanced
Loki Multi-Tenancy & Retention Design Prompt
Design Grafana Loki tenant isolation, per-tenant retention, and stream/label schema that controls cardinality and cost while keeping logs queryable alongside Prometheus metrics.
- Claude
- ChatGPT
Open prompt - Advanced
Long-Term Metrics Storage Backend Selection Prompt
Choose between Thanos, Grafana Mimir, and VictoriaMetrics for long-term, scalable Prometheus storage based on your scale, team size, object-storage strategy, and multi-tenancy needs.
- Claude
- ChatGPT
Open prompt - Advanced
OpenTelemetry Tail Sampling Policy Design Prompt
Design an OpenTelemetry Collector tail-sampling policy that keeps every error and slow trace while cheaply down-sampling healthy traffic, and feeds clean span metrics into Prometheus.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Scrape & Evaluation Interval Tuning Prompt
Choose scrape_interval and evaluation_interval values that balance alert latency, query resolution, storage cost, and scrape-target load without breaking rate() math.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Staleness & Stale Markers Prompt
Understand and debug Prometheus staleness handling — stale markers, the 5-minute lookback, disappearing targets, and how staleness interacts with alert rules and absent().
- Claude
- ChatGPT
Open prompt - Intermediate
PromQL offset & Time-Shifted Comparison Prompt
Build week-over-week and day-over-day PromQL comparisons using offset and @ modifiers to surface regressions, seasonality, and anomalous deviations against a known-good baseline.
- Claude
- ChatGPT
Open prompt - Beginner
Recording Rule Naming Convention Prompt
Adopt the standard level:metric:operations recording-rule naming convention so pre-aggregated series are self-documenting, discoverable, and safe to reuse across dashboards and alerts.
- Claude
- ChatGPT
Open prompt - Advanced
Alertmanager HA Cluster & Gossip Mesh Design Prompt
Design and debug a highly available Alertmanager cluster — gossip mesh, notification deduplication across replicas, and split-brain avoidance — so alerts fire exactly once during failures.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Notification Policies & Contact Points Design Prompt
Design Grafana Alerting notification policy trees and contact points — label-based routing, nested policies, mute timings, and grouping — so the right team gets paged through the right channel.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana SLO Burn-Rate Dashboard Design Prompt
Design a Grafana SLO dashboard that visualizes error-budget remaining, multi-window burn rate, and time-to-exhaustion so stakeholders see reliability health at a glance.
- Claude
- ChatGPT
Open prompt - Advanced
OpenTelemetry Span Metrics Connector for RED Metrics Prompt
Configure the OpenTelemetry Collector spanmetrics connector to derive RED (rate, errors, duration) metrics from traces and export them to Prometheus without exploding cardinality.
- Claude
- ChatGPT
Open prompt - Beginner
Prometheus Alert Runbook & Annotation Standardization Prompt
Standardize alert annotations and auto-generate actionable runbooks so every Prometheus alert carries a summary, impact, diagnosis steps, and a remediation link before it ever pages.
- Claude
- ChatGPT
Open prompt - Beginner
Prometheus TLS Certificate Expiry Monitoring Prompt
Set up Prometheus + blackbox exporter to monitor TLS certificate expiry across endpoints and design tiered alerts that warn before, not after, a cert outage.
- Claude
- ChatGPT
Open prompt - Intermediate
PromQL topk / bottomk Ranking & Top-N Dashboard Queries Prompt
Build correct, fast PromQL ranking queries with topk, bottomk, and aggregation so dashboards show the noisiest pods, hottest nodes, and worst endpoints without flapping legends.
- Claude
- ChatGPT
Open prompt - Advanced
VictoriaMetrics Migration from Prometheus Prompt
Plan and execute a migration from vanilla Prometheus to VictoriaMetrics (vmagent, vmstorage, vmalert) for long-term storage and lower resource use — with backfill, PromQL/MetricsQL parity, and rollback.
- Claude
- ChatGPT
Open prompt - Intermediate
Alertmanager Webhook Receiver Integration Prompt
Build a robust custom webhook receiver for Alertmanager — parsing the v4 payload, handling firing/resolved, verifying signatures, and bridging alerts into ticketing, automation, or chatops safely.
- Claude
- ChatGPT
Open prompt - Advanced
Grafana Dashboards as Code with Grafonnet Prompt
Generate maintainable, DRY Grafana dashboards as code with Grafonnet/Jsonnet — reusable panel libraries, templated rows, and a CI pipeline that lints and diffs dashboards on every PR.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana OnCall Escalation Chain Design Prompt
Design Grafana OnCall escalation chains, schedules, and routing so the right human is paged within minutes, noise is suppressed, and nobody gets woken up for a warning.
- Claude
- ChatGPT
Open prompt - Advanced
OpenTelemetry Temporality & Prometheus Compatibility Prompt
Reconcile OpenTelemetry's delta vs cumulative temporality with Prometheus's cumulative-only model so OTel metrics don't break rate() and counters don't reset spuriously.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus for & keep_firing_for Tuning Prompt
Tune the `for` (pending) and `keep_firing_for` (resolve hysteresis) clauses on alert rules to kill flapping without delaying real incidents.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Query Frontend & Vertical Sharding Prompt
Speed up slow, heavy PromQL by putting a query-frontend in front of Prometheus/Thanos/Mimir — splitting queries by time, sharding by series, and caching results.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL Holt-Winters Seasonal Forecasting Prompt
Smooth noisy seasonal metrics and forecast short-term trends with double_exponential_smoothing (Holt-Winters) so alerts account for daily/weekly cycles instead of firing every Monday morning.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL label_replace & label_join Rewriting Prompt
Reshape, normalize, and synthesize labels at query time with label_replace() and label_join() so heterogeneous metrics join cleanly and dashboards stay readable without re-instrumenting exporters.
- Claude
- ChatGPT
Open prompt - Intermediate
SLI Specification & SLO Menu Design Prompt
Define meaningful SLIs and set defensible SLO targets from user journeys — choosing the right event ratio, window, and target before any burn-rate alerting exists.
- Claude
- ChatGPT
Open prompt - Intermediate
Alertmanager Time Intervals & Mute Schedules Prompt
Design business-hours routing, maintenance-window muting, and follow-the-sun on-call handoffs in Alertmanager using time_intervals, mute_time_intervals, and active_time_intervals — without dropping real pages.
- Claude
- ChatGPT
Open prompt - Intermediate
kube-state-metrics & cAdvisor Alerting Prompt
Build the essential Kubernetes workload alerts from kube-state-metrics and cAdvisor — CrashLoopBackOff, OOMKills, pending pods, throttling, and PVC pressure — with correct joins and no double-paging.
- Claude
- ChatGPT
Open prompt - Intermediate
Loki LogQL Metric Queries & Log-Based Alerts Prompt
Turn logs into Prometheus-style signals with LogQL metric queries and Loki's ruler — extracting numbers from unstructured logs, counting error patterns, and firing alerts on log-derived rates without instrumenting the app.
- Claude
- ChatGPT
Open prompt - Advanced
OpenTelemetry Collector to Prometheus Pipeline Prompt
Design an OpenTelemetry Collector pipeline that ingests OTLP metrics and exposes or remote-writes them to Prometheus/Mimir cleanly — handling delta-to-cumulative, resource attributes, naming normalization, and cardinality at the collector edge.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Agent Mode Deployment Prompt
Deploy Prometheus in Agent mode as a lightweight, scrape-and-remote-write-only collector feeding a central Mimir/Thanos/Cortex backend — sizing, WAL tuning, sharding, and the tradeoffs vs. full Prometheus.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Exemplars & Trace Correlation Prompt
Wire Prometheus exemplars end-to-end so a spike on a latency histogram links directly to the slow trace in Tempo — covering instrumentation, OpenMetrics exposition, storage, and Grafana exemplar links.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL Anomaly Detection & Z-Score Alerting Prompt
Build statistical anomaly-detection alerts in pure PromQL — z-score deviation from a rolling baseline, week-over-week seasonal comparison, and MAD-based outlier detection — so you catch weird behavior static thresholds miss.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL predict_linear Capacity Forecasting Prompt
Build predictive PromQL alerts that fire BEFORE disks fill, certificates expire, or quotas exhaust — using predict_linear, deriv, and seasonal-aware windows instead of static thresholds.
- Claude
- ChatGPT
Open prompt - Intermediate
Tempo TraceQL Query Design Prompt
Write precise TraceQL queries to find slow, errored, or anomalous traces in Grafana Tempo — using span/resource attribute filters, structural operators, aggregates, and metrics-from-traces — instead of guessing in trace search.
- Claude
- ChatGPT
Open prompt - Intermediate
Alertmanager Notification Templates Prompt
Write reusable Go notification templates for Alertmanager — custom subject/body for email, PagerDuty, webhooks, and generic receivers — with clean iteration over grouped alerts and safe defaults.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Client Instrumentation Prompt
Instrument an application with a Prometheus client library — choosing counters/gauges/histograms/summaries, label design, the RED/USE methods, and avoiding cardinality and naming mistakes at the source.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Federation Hierarchy Prompt
Design a hierarchical or cross-service Prometheus federation topology — global aggregation, per-datacenter shards, /federate match[] selectors, and the trade-offs versus remote-write.
- Claude
- ChatGPT
Open prompt - Beginner
Prometheus Metric Naming Conventions Prompt
Define and enforce a metric and label naming standard across teams — base units, suffixes, namespacing, label conventions, and a CI linter to keep new metrics consistent.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Relabeling Rules Prompt
Author and debug relabel_configs and metric_relabel_configs to filter targets, rewrite labels, drop expensive series, and normalize metadata before and after scraping.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Rule Group Evaluation Order Prompt
Structure recording and alerting rule groups so dependent rules evaluate in the right order, intervals are sized correctly, and evaluation latency stays bounded.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL Vector Matching & Joins Prompt
Master many-to-one and one-to-many PromQL joins using on, ignoring, group_left, and group_right to enrich metrics with metadata or combine series across metric names.
- Claude
- ChatGPT
Open prompt - Advanced
SNMP Exporter Network Metrics Prompt
Configure the Prometheus SNMP Exporter to monitor switches, routers, firewalls, and UPSes — generator.yml modules, MIB walks, auth (v2c/v3), and mapping OIDs to clean labeled metrics.
- Claude
- ChatGPT
Open prompt - Advanced
Thanos Compactor & Downsampling Prompt
Configure and troubleshoot the Thanos Compactor — compaction levels, 5m/1h downsampling, retention per resolution, and the deduplication and halt pitfalls that corrupt object storage.
- Claude
- ChatGPT
Open prompt - Advanced
Alertmanager Inhibition & Silence Strategy Prompt
Design inhibition rules and silences that suppress downstream noise — when a node dies, don't also page for every pod on it — without ever muting the alert that actually matters.
- Claude
- ChatGPT
Open prompt - Advanced
Grafana Mimir Multi-Tenant Operations Prompt
Operate Grafana Mimir at scale — tenant isolation, per-tenant limits, ingester/store-gateway sharding, compaction, and remote-write onboarding without one tenant starving the rest.
- Claude
- ChatGPT
Open prompt - Beginner
node_exporter Textfile Collector Prompt
Expose custom host-level metrics — backup freshness, cert expiry, cron job results, hardware checks — through the node_exporter textfile collector with correct format, atomic writes, and staleness handling.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Metric Cardinality Control Prompt
Find, quantify, and kill the high-cardinality label combinations that bloat your TSDB, blow up memory, and slow queries — then put guardrails in place so it never regresses.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Dead-Man's-Switch & absent() Alerts Prompt
Build the alerts that fire when metrics STOP arriving — scrape failures, missing targets, silent exporters, and a watchdog that proves your whole alerting pipeline is alive end to end.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Native Histograms Migration Prompt
Plan and execute the move from classic bucketed histograms to native (sparse) histograms — instrumentation changes, dual-emit rollout, query rewrites, and the storage/accuracy tradeoffs.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Operator ServiceMonitor & PodMonitor Prompt
Author and debug ServiceMonitor/PodMonitor/PrometheusRule CRDs for the Prometheus Operator so scrapes actually get discovered, with the right label/namespace selectors and relabeling.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Pushgateway for Batch Jobs Prompt
Instrument short-lived and batch/cron jobs with the Pushgateway correctly — grouping keys, the right metrics to push, lifecycle cleanup, and alerts that catch a job that never ran.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL Subqueries & *_over_time Aggregation Prompt
Master PromQL subqueries and the *_over_time family to compute rolling maxima, percentiles of a rate, trends, and 'has it ever crossed X in the last hour' — without melting your query engine.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Version Upgrade & Migration Prompt
Upgrade Grafana major versions — DB migrations, plugin compatibility, deprecated features, alert migration.
- Claude
- ChatGPT
Open prompt - Intermediate
Loki Log Aggregation Design Prompt
Design Loki log aggregation — single-binary vs distributed, retention, label strategy, LogQL queries, multi-tenancy.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Alloy Agent Configuration Prompt
Configure Grafana Alloy (formerly Grafana Agent) — unified collector for metrics, logs, traces; River configuration; component pipeline.
- Claude
- ChatGPT
Open prompt - Advanced
SLO Error Budget & Multi-Window Burn Rate Alerts Prompt
Design SLO-based alerts — error budgets, multi-burn-rate alerting, SLI selection, burn budget calculation.
- Claude
- ChatGPT
Open prompt - Advanced
Custom Prometheus Exporter Design Prompt
Design and write a custom Prometheus exporter — client library, metric types, registration, scrape efficiency.
- Claude
- ChatGPT
Open prompt - Advanced
Grafana Pyroscope Continuous Profiling Prompt
Add continuous profiling with Pyroscope — flame graphs in Grafana, language SDKs, push vs pull, sampling overhead.
- Claude
- ChatGPT
Open prompt - Intermediate
Blackbox Exporter Probe Configuration Prompt
Configure blackbox_exporter for HTTP, TCP, ICMP, DNS probes — uptime monitoring, certificate expiry, response validation.
- Claude
- ChatGPT
Open prompt - Advanced
Grafana HA & Database Backend Tuning Prompt
Run Grafana in HA — multiple replicas, shared database (PostgreSQL/MySQL), session storage, Alertmanager cluster integration.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Provisioning as Code Prompt
Provision Grafana — data sources, dashboards, alerts via file provisioning, dashboards as code, sidecar pattern in Kubernetes.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Snapshots, Reports & Sharing Prompt
Share Grafana dashboards — snapshots (anonymous), PDF reports, scheduled email reports, public dashboards, embed.
- Claude
- ChatGPT
Open prompt - Beginner
Grafana Playlists & Kiosk Mode Prompt
Set up Grafana playlists for NOC dashboards — rotating views, kiosk mode, TV-friendly displays, auto-refresh.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Unified Alerting Prompt
Configure Grafana's unified alerting — contact points, notification policies, mute timings, multi-dimensional alerts, alert state.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Service Accounts & API Tokens Prompt
Manage Grafana service accounts and API tokens — automation access, scoping, rotation, replacing legacy API keys.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Templating & Variables Design Prompt
Design Grafana variables — query variables, custom, interval, chained, multi-value, regex; debug missing values, slow load.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Dashboard Performance Prompt
Optimize Grafana dashboards — query parallelism, refresh rates, variable design, panel count, data source pressure.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana RBAC, Teams & Folder Permissions Prompt
Design Grafana access control — folders, teams, role-based permissions, viewer vs editor, dashboard / folder permissions.
- Claude
- ChatGPT
Open prompt - Advanced
Grafana SSO / SAML / OIDC Integration Prompt
Configure and debug Grafana auth — SAML, OIDC, OAuth, LDAP; role mapping, group sync, just-in-time provisioning.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Performance Tuning Prompt
Tune Prometheus performance — head series, memory, query timeout, max samples, ingestion rate, expensive queries.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Plugin Installation & Management Prompt
Install, manage, and troubleshoot Grafana plugins — panel plugins, data source plugins, signing, sandbox, version control.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus HA & Deduplication Prompt
Run Prometheus in HA — paired servers, deduplication strategies (Thanos query, Alertmanager cluster, federation), failover.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Mixed Data Sources Panel Prompt
Build panels combining multiple data sources — Mixed DS, cross-DS variables, correlated queries from Prometheus + Loki + cloud metrics.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Scrape Config & Service Discovery Prompt
Configure Prometheus scrape targets — kubernetes_sd, ec2_sd, file_sd, consul_sd, relabeling, scrape interval tuning.
- Claude
- ChatGPT
Open prompt - Advanced
Grafana Tempo Distributed Tracing Prompt
Visualize traces in Grafana — Tempo data source, service graph, span metrics, trace search, OTLP integration.
- Claude
- ChatGPT
Open prompt - Advanced
Thanos Architecture & Component Debug Prompt
Operate Thanos — Sidecar, Receive, Store Gateway, Compactor, Querier, Ruler; troubleshoot dedup, downsampling, S3 issues.
- Claude
- ChatGPT
Open prompt - Advanced
Grafana Loki + Prometheus Correlation Prompt
Correlate metrics and logs in Grafana — exemplars from Prometheus to traces, derived fields from Loki, jump from spike to log line.
- Claude
- ChatGPT
Open prompt - Advanced
Prometheus Remote Write & Long-term Storage Prompt
Configure remote write to long-term storage — Thanos Receive, Cortex/Mimir, VictoriaMetrics, troubleshoot queue/backlog/back-pressure.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Time Series Best Practices Prompt
Build readable time series panels — units, legends, axis scaling, fill / stack mode, overrides, color choices.
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Storage, Retention & TSDB Prompt
Configure Prometheus TSDB — retention, block size, compaction, WAL, disk sizing, troubleshooting OOM / disk-full.
- Claude
- ChatGPT
Open prompt - Intermediate
Alert Fatigue Reduction Strategy Prompt
Reduce alert fatigue — SLO-based alerts vs symptom-based, severity tiers, runbook integration, deprecating noisy alerts.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Logs Panel & Derived Fields Prompt
Use Grafana Logs panel — Loki queries, derived fields (link to traces), log volume panel, streaming logs.
- Claude
- ChatGPT
Open prompt - Intermediate
Alertmanager Routing, Grouping & Receivers Prompt
Design Alertmanager routes — receivers (Slack, PagerDuty), grouping, inhibition, repeat intervals, mute timings.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Heatmap & Histogram Visualization Prompt
Configure Grafana heatmaps for latency distribution — bucket binning, classic vs new heatmap, histogram source data.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Stat Panels, Thresholds & SLA Visualization Prompt
Design stat panels with threshold colors, SLA compliance visualization, multi-value stat layouts.
- Claude
- ChatGPT
Open prompt - Intermediate
PromQL `rate()` vs `increase()` vs `irate()` Prompt
Use Prometheus counter functions correctly — rate vs increase vs irate, counter resets, window size choice.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Transformations Design Prompt
Use Grafana transformations — join queries, calculate fields, filter, rename, group by, organize columns; combine data without changing queries.
- Claude
- ChatGPT
Open prompt - Intermediate
PromQL Histogram & Quantile Calculation Prompt
Use Prometheus histograms correctly — `histogram_quantile`, bucket bounds, p99 latency calculation, histogram vs summary, native histograms.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Annotations & Event Overlays Prompt
Add deploy markers, incident events, scheduled maintenance overlays to Grafana dashboards — query-based, manual, tag filtering.
- Claude
- ChatGPT
Open prompt - Intermediate
PromQL Recording Rules Design Prompt
Design Prometheus recording rules — naming convention, evaluation interval, when to use, retention, multi-cluster patterns.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Panel Types Selection Prompt
Choose the right Grafana panel — timeseries vs stat vs gauge vs bar gauge vs heatmap vs table; visualization principles for each.
- Claude
- ChatGPT
Open prompt - Advanced
PromQL Query Optimization Prompt
Diagnose slow PromQL queries — cardinality explosion, range vector traps, sum vs avg pitfalls, query timeout, recording rules opportunity.
- Claude
- ChatGPT
Open prompt - Intermediate
Grafana Dashboard Query Builder Prompt
Generate PromQL and Grafana panel JSON for service dashboards (RED, USE, golden signals).
- Claude
- ChatGPT
Open prompt - Intermediate
Prometheus Alert Rule Generator Prompt
Generate production-quality Prometheus alerting rules with sensible thresholds, labels, and runbook annotations.
- Claude
- ChatGPT
Open prompt
Guides
- · 9 min read
Prometheus Error Guide: 'alertmanager failed to join cluster' Gossip Failure
Fix Alertmanager 'failed to join cluster': open port 9094 TCP+UDP, set --cluster.advertise-address, and stop duplicate notifications from a non-converged gossip cluster.
Read guide - · 9 min read
Prometheus Error Guide: Alert Stuck 'Pending' and Never Firing
Fix Prometheus alerts stuck in Pending or missing from /alerts: tune for and evaluation_interval, verify the expression returns series, and check rule loading and silences.
Read guide - · 9 min read
Prometheus Error Guide: 'binary expression must contain only scalar and instant vector types' Type Mismatch
Fix PromQL 'binary expression must contain only scalar and instant vector types' errors: wrap range vectors in rate(), use scalar(), and add on()/ignoring() matching.
Read guide - · 9 min read
Prometheus Error Guide: 'compaction failed' TSDB Block Corruption
Fix Prometheus 'compaction failed' errors: remove corrupt blocks, free disk space, recover from unclean shutdowns, and restore from snapshots without losing your TSDB.
Read guide - · 9 min read
Prometheus Error Guide: 'Error loading config (--config.file=/etc/prometheus/prometheus.yml)' Reload Failure
Fix Prometheus 'Error loading config' and HTTP 400 reload failures: validate YAML with promtool, enable web lifecycle, and resolve indentation, regex, and env var issues.
Read guide - · 9 min read
Prometheus Error Guide: 'duplicate sample for timestamp' Colliding Label Sets
Fix Prometheus 'duplicate sample for timestamp' errors: dedupe exporters exposing repeated series, add unique instance/job labels, and stop relabeling collapsing label sets.
Read guide - · 9 min read
Prometheus Error Guide: 'found multiple scrape configs with job name' Duplicate Job
Fix Prometheus 'found multiple scrape configs with job name' errors: locate colliding job_names across included files, dedupe scrape configs, and validate with promtool.
Read guide - · 9 min read
Prometheus Error Guide: 'Empty query result' / No Data for an Existing Metric
Fix Prometheus 'Empty query result' and 'No data' when a metric should exist: label typos, stale series, stopped targets, lookback delta, and short rate() ranges.
Read guide - · 9 min read
Prometheus Error Guide: 'exceeded maximum resolution of 11000 points per timeseries' Range Query Resolution
Fix Prometheus 'exceeded maximum resolution of 11000 points' by raising the step, setting a Grafana min interval, and using recording rules for wide range queries.
Read guide - · 9 min read
Prometheus Error Guide: 'found duplicate series for the match group' Vector Matching Failure
Fix the PromQL 'found duplicate series for the match group' error: add group_left/group_right for many-to-one joins, or deduplicate a non-unique one-side.
Read guide - · 9 min read
Prometheus Error Guide: 'invalid metric type' Scrape Parse Failure
Fix Prometheus 'invalid metric type' scrape parse errors: correct misspelled # TYPE tokens, serve OpenMetrics types with the right content-type, and validate with promtool.
Read guide - · 9 min read
Prometheus Error Guide: 'found error when loading rules' Invalid Rule Group
Fix Prometheus 'found error when loading rules' and 'could not parse expression' failures: validate PromQL, fix templating, dedupe rule names, and unit-test with promtool.
Read guide - · 9 min read
Prometheus Error Guide: 'label_limit exceeded' Target Scrape Rejected
Fix Prometheus 'label_limit exceeded' scrape failures: find the offending exporter, drop or relabel oversized labels, and raise label_limit safely without target downtime.
Read guide - · 9 min read
Prometheus Error Guide: 'lock DB directory: resource temporarily unavailable' Startup Failure
Fix Prometheus 'lock DB directory: resource temporarily unavailable' at startup: find and stop the second process holding the TSDB lock file before restarting.
Read guide - · 9 min read
Prometheus Error Guide: 'node_exporter permission denied collecting' Collector Failure
Fix node_exporter 'permission denied' collector errors: relax the systemd sandbox, fix textfile ownership, add bind mounts, or disable collectors you don't need.
Read guide - · 9 min read
Prometheus Error Guide: 'OOMKilled' (exit 137) High Memory Crashes
Fix Prometheus OOMKilled (exit 137) and out-of-memory crashes: cut cardinality, drop labels, add recording rules, size memory limits, and shard before the pod dies again.
Read guide - · 9 min read
Prometheus Error Guide: 'out of bounds' Sample Too Old or Too Far in the Future
Fix Prometheus 'out of bounds' ingestion errors: correct target clock skew, enable the out-of-order window, and backfill old data with promtool instead of remote-writing it.
Read guide - · 9 min read
Prometheus Error Guide: 'parse error: unexpected' PromQL Syntax Errors
Fix PromQL 'parse error: unexpected character/identifier' and 'no arguments for aggregate expression' errors: unbalanced brackets, range selectors, and aggregation syntax.
Read guide - · 9 min read
Prometheus Error Guide: 'rate should only be used with counters' Non-Counter rate() Misuse
Fix Prometheus 'metric might not be a counter (used with rate)' info and nonsensical rate() values: apply rate() to counters only, use deriv()/delta() for gauges.
Read guide - · 9 min read
Prometheus Error Guide: 'remote_write server returned HTTP status 500' Receiver Failure
Fix Prometheus remote_write 500 errors: the receiver (Mimir, Thanos Receive, Cortex) is broken — check ingesters, object storage, and proxy timeouts, not Prometheus.
Read guide - · 9 min read
Prometheus Error Guide: 'rule manager error evaluating rule' Runtime Evaluation Failure
Fix Prometheus rule manager 'Evaluating rule failed' errors: dedupe vector matches, make recording-rule labelsets unique, and tame heavy or many-to-one queries.
Read guide - · 9 min read
Prometheus Error Guide: 'server returned HTTP status 401 Unauthorized' Scrape Auth
Fix Prometheus scrape '401 Unauthorized' and '403 Forbidden' errors: configure basic_auth, bearer_token, authorization, fix kubelet RBAC, and rotate expired tokens.
Read guide - · 9 min read
Prometheus Error Guide: 'connect: connection refused' Scrape Target DOWN
Fix Prometheus scrape 'connection refused', 'connection reset by peer', and 'no route to host' errors: diagnose dead exporters, wrong ports, firewalls, and bind addresses.
Read guide - · 9 min read
Prometheus Error Guide: 'scrape sample limit exceeded' Target Down on Cardinality
Fix Prometheus 'sample limit exceeded' target-down errors: count exposed series, identify high-cardinality exporters, drop noisy metrics, and raise sample_limit safely.
Read guide - · 9 min read
Prometheus Error Guide: 'x509: certificate signed by unknown authority' Scrape TLS
Fix Prometheus 'x509: certificate signed by unknown authority' and 'certificate is valid for X, not Y' scrape errors: set tls_config ca_file, server_name, and renew expired certs.
Read guide - · 9 min read
Prometheus Error Guide: 'x509: certificate has expired or is not yet valid' Scrape Failure
Fix Prometheus scrape failures from an expired or not-yet-valid TLS cert: confirm the clock, inspect the target cert with openssl, and rotate it — don't skip verification.
Read guide - · 9 min read
Prometheus Error Guide: 'up == 0' Target DOWN Triage Hub
Fix any Prometheus target showing DOWN with up == 0: triage with the Targets and Service Discovery pages, read the last scrape error, and route to the right root-cause guide.
Read guide - · 9 min read
Prometheus Error Guide: 'no space left on device' TSDB Disk Full
Fix Prometheus 'no space left on device' TSDB errors: set retention size and time caps, free the data dir, cut cardinality, grow the disk, and offload long-term to remote write.
Read guide - · 9 min read
Prometheus Error Guide: 'replaying WAL' Slow Startup and Not-Ready Failure
Fix slow Prometheus 'replaying WAL' startup: stop the restart loop, switch a killing livenessProbe to a startupProbe, add memory headroom, and shrink the head.
Read guide - · 10 min read
Alertmanager Grouping Timers: group_wait, group_interval, and repeat_interval
The three Alertmanager grouping timers are constantly confused. Here's what each one actually controls and how to tune them so pages batch sensibly without re-paging noise.
Read guide - · 10 min read
The $__rate_interval Trap: Why Grafana rate() Panels Lie When You Zoom
Grafana rate() panels that go flat when you zoom in are almost always using the wrong interval variable. Here's why $__rate_interval exists and when to use it.
Read guide - · 10 min read
metric_relabel_configs as a Cardinality Firewall
metric_relabel_configs drops noisy series at ingest before they ever reach the TSDB. Here's how to build a drop list that cuts cardinality without breaking alerts.
Read guide - · 11 min read
Native Histograms vs Classic Buckets: Getting Quantiles You Can Trust
Prometheus native histograms promise better percentiles than fixed buckets. Here's how their accuracy actually differs, and how to query them without carrying classic-histogram habits.
Read guide - · 11 min read
OpenTelemetry Collector Backpressure: memory_limiter, batch, and Queues
The OTel Collector OOMs for fixable reasons rooted in processor order and queue sizing. Here's how memory_limiter, batch, and the exporter queue interact under load.
Read guide - · 10 min read
Prometheus Error Guide: 'context deadline exceeded' Alertmanager Notifications Failing
Fix Alertmanager notification failures: SMTP errors, webhook timeouts, 'context deadline exceeded', and silent drops. Diagnose receivers, routing, and config reloads.
Read guide - · 10 min read
Prometheus Error Guide: 'context deadline exceeded' Scrape Timeout
Fix the Prometheus 'context deadline exceeded' scrape error: diagnose slow targets, low scrape_timeout, large /metrics payloads, DNS latency, and TLS handshake delays.
Read guide - · 9 min read
Prometheus Error Guide: 'Bad Gateway' Grafana Datasource Error / No Data
Fix Grafana 'Bad Gateway', Prometheus datasource errors, and 'No data' panels: diagnose proxy/URL config, time ranges, label mismatches, and query step issues.
Read guide - · 9 min read
Prometheus Error Guide: 'many-to-many matching not allowed' PromQL Vector Matching
Fix the PromQL 'many-to-many matching not allowed' and 'found duplicate series' errors: diagnose mismatched labels, missing on()/ignoring(), and group_left/group_right.
Read guide - · 10 min read
Prometheus Error Guide: 'opening storage failed' TSDB / WAL Corruption
Fix Prometheus 'opening storage failed' TSDB and WAL corruption: diagnose unclean shutdowns, full disks, OOM kills, and recover with WAL repair or block removal.
Read guide - · 9 min read
Prometheus Error Guide: 'out of order sample' Duplicate Sample Ingestion
Fix Prometheus 'out of order sample' and 'duplicate sample for timestamp' ingestion errors: diagnose clock skew, duplicate targets, label collisions, and OOO window settings.
Read guide - · 10 min read
Prometheus Error Guide: 'query timed out' Too Many Samples Loaded
Fix Prometheus 'query timed out' and 'query processing would load too many samples' errors: diagnose high cardinality, wide ranges, expensive PromQL, and query limits.
Read guide - · 10 min read
Prometheus Error Guide: 'remote_write 429' Server Returned HTTP Status 400
Fix Prometheus remote_write errors: 429 rate limits, 400 bad request, and 'server returned HTTP status' failures. Diagnose backpressure, label limits, and queue tuning.
Read guide - · 9 min read
Prometheus Error Guide: 'too many open files' File Descriptor Limit
Fix the Prometheus 'too many open files' error: diagnose low ulimit, leaked connections, high target counts, and TSDB block fan-out. Raise nofile and verify limits.
Read guide - · 10 min read
Protecting the Prometheus Read Path: max-samples, timeout, and Concurrency
One runaway query can OOM a shared Prometheus and take monitoring down for everyone. Here's how query.max-samples, timeout, and concurrency limits make queries fail safely.
Read guide - · 10 min read
quantile_over_time vs histogram_quantile: Which Percentile to Trust
Two PromQL functions compute percentiles in completely different ways, and picking the wrong one gives a confidently wrong number. Here's how to choose and verify.
Read guide - · 11 min read
Thanos Store Gateway Caching Tiers Explained
The Thanos Store Gateway lives or dies by three caches: index-header, index cache, and the caching bucket. Here's what each holds and how to size them without OOMing.
Read guide - · 11 min read
Cutting Cardinality at Ingest With vmagent Stream Aggregation
VictoriaMetrics stream aggregation collapses high-cardinality series into aggregates before storage. Here's how to design rules that save space without breaking queries.
Read guide - · 11 min read
Debugging Prometheus Relabeling Drops With AI Without Guessing
AI is great at reasoning through relabel_configs, but it can't see your live targets. How I use it to debug dropped Prometheus scrape targets safely.
Read guide - · 10 min read
Reviewing AI-Generated Grafana Alert Rules Before They Go Live
Grafana's unified alerting hides real complexity behind a friendly UI. How I review AI-generated Grafana alert rules so they don't fire wrong or stay silent.
Read guide - · 11 min read
AI Instrumentation Review: Catching Label Explosions at Code Time
Cardinality bombs are born in application code, not Prometheus. How I use AI to review instrumentation before high-cardinality labels ever reach the TSDB.
Read guide - · 11 min read
Building Incident Timelines From Prometheus Data With AI
AI can assemble a postmortem timeline from Prometheus metrics in minutes, but it can also invent causality. How I build accurate, evidence-backed timelines.
Read guide - · 10 min read
Catching PromQL Unit Mistakes With AI Before They Mislead
Bytes vs bits, seconds vs milliseconds, ratios vs percentages — PromQL unit bugs are silent and dangerous. How I use AI to catch them before they ship.
Read guide - · 10 min read
Enriching Prometheus Alert Annotations With Live Query Context
An alert that says only what fired wastes on-call time. How I use AI to write annotation templates that pull live PromQL context into every page.
Read guide - · 10 min read
Generating Blackbox Exporter Probe Configs With AI Safely
The Prometheus blackbox exporter is fiddly YAML that AI writes fast. How I generate probe modules and scrape configs without shipping false-green checks.
Read guide - · 11 min read
Migrating Nagios Checks to Prometheus Alerts With AI
AI can translate hundreds of Nagios checks to Prometheus alert rules fast, but a naive port recreates years of alert noise. How I migrate without the rot.
Read guide - · 11 min read
Unit Testing Prometheus Alert Rules With Promtool and AI
AI can write promtool unit tests for your alert rules in seconds, but only you can decide what they should prove. How I generate and review alert rule tests.
Read guide - · 12 min read
What Is Infrastructure Observability? A 2026 Guide
What infrastructure observability is, how it differs from monitoring, the core signals (metrics, logs, traces), and how to implement it without drowning in data.
Read guide - · 9 min read
Alertmanager Inhibition Rules and Silences Done Right
Stop alert storms with Alertmanager inhibit_rules and silences. Real source/target matcher YAML, amtool commands, expiring silences, and review tips.
Read guide - · 9 min read
Detecting Dead Targets in Prometheus with absent() and Staleness Markers
How to alert when a Prometheus metric stops existing using absent(), absent_over_time(), and up==0, plus the staleness rules that silently break no-data alerts.
Read guide - · 10 min read
Enforcing Tenant Labels in Multi-Tenant Prometheus and Mimir
How to inject and validate tenant/team labels with relabel_configs, write_relabel_configs, and X-Scope-OrgID so cost attribution and access control hold up.
Read guide - · 10 min read
Grafana Dashboards as Code with Grafonnet: A GitOps Workflow That Scales
Stop hand-editing dashboard JSON. Define Grafana panels and templating as Grafonnet code, generate JSON with jsonnet, provision via Git, and review diffs in CI.
Read guide - · 10 min read
Prometheus Federation vs Remote-Write: Which to Use and When
Federation aggregates recording-rule outputs across teams; remote-write centralizes raw series. Learn which Prometheus pattern fits, with real configs.
Read guide - · 11 min read
Prometheus TSDB Internals: Head Block, WAL, Compaction & Retention Explained
A deep dive into Prometheus TSDB internals — the head block, WAL, on-disk blocks, compaction and retention — with PromQL, flags, and disk sizing tips.
Read guide - · 10 min read
PromQL rate() vs irate() vs increase(): When Each One Lies to You
A working SRE's guide to PromQL rate, irate, and increase on counters: extrapolation, lookback gotchas, when each misleads, and reviewing AI-drafted queries.
Read guide - · 10 min read
PromQL Subqueries and _over_time: Trend Analysis Without the Guesswork
A practical guide to PromQL subqueries and the _over_time family for spotting trends, slow leaks, and daily peaks, plus why recording rules often win.
Read guide - · 10 min read
Scaling Prometheus Scraping: Functional Sharding, Hashmod, and Agent Mode
Scale Prometheus scraping horizontally with functional sharding, hashmod scrape sharding, and Agent Mode. Real relabel configs, agent-mode flags, and tradeoffs.
Read guide - · 11 min read
AI-Assisted PromQL for Latency Percentiles That Don't Lie
histogram_quantile trips up everyone. How I use AI to write correct p95/p99 latency queries and avoid the aggregation traps that quietly fake your SLOs.
Read guide - · 10 min read
AI-Assisted Recording Rules: Turning Slow PromQL Into Fast Dashboards
Heavy PromQL queries hammer Prometheus and lag dashboards. How I use AI to find expensive expressions and refactor them into correct, fast recording rules.
Read guide - · 9 min read
Using AI to Build a Runbook Annotation Library for Your Alerts
Every alert should link a runbook, but most don't because writing them is tedious. How I use AI to draft alert annotations and runbooks useful at 3am.
Read guide - · 10 min read
Debugging 'No Data' and Silently-Broken Prometheus Alerts With AI
An alert that never fires feels safe and is the most dangerous kind. How I use AI to diagnose no-data alerts, stale series, and rules that quietly broke.
Read guide - · 12 min read
Humanizing Artificial Intelligence in Metrics Analysis: Turning Raw Time-Series Into Clear DevOps Answers
How AI turns raw Prometheus metrics, PromQL, and Grafana dashboards into clear, plain-English answers about what changed and why — with a human still in control.
Read guide - · 10 min read
Investigating a Prometheus Cardinality Spike With AI as Your Co-Investigator
A cardinality explosion can OOM Prometheus overnight. How I use AI to find the offending label, trace its source, and design a relabel fix without guessing.
Read guide - · 11 min read
Refactoring Legacy Threshold Alerts to Burn-Rate Alerts With AI
Old 'error rate over 1% for 5m' alerts page too much and catch too little. How I use AI to migrate threshold alerts to SLO burn-rate alerting safely.
Read guide - · 10 min read
How to Review AI-Generated Prometheus Alert Rules Before They Page
AI writes alert rules in seconds, but a bad rule pages you at 3am or hides an outage. The review checklist I run on every AI-generated Prometheus alert.
Read guide - · 11 min read
Turning Plain-English SLO Requirements Into PromQL With AI
Your SLO lives in a doc as English prose. How I use AI to translate '99.9% of checkouts succeed' into correct SLI queries, budgets, and burn-rate alerts.
Read guide - · 10 min read
Using AI to Untangle an Inherited PromQL Query
Inherited a 200-character PromQL one-liner with no comments? How I use AI to decompose, explain, and safely refactor gnarly queries without breaking dashboards.
Read guide - · 12 min read
Best AI Tools for SRE Teams in 2026 (A Practitioner's Guide)
A practical roundup of the AI tools that actually help SRE teams in 2026 — for incident response, PromQL, postmortems, toil reduction, and IaC review.
Read guide - · 9 min read
Capacity Planning With Prometheus Queries That Predict
Most teams find out they're out of capacity when it's already a 3am page. These PromQL patterns turn your existing metrics into forecasts of when you'll run out of headroom.
Read guide - · 8 min read
Continuous Profiling With Pyroscope Alongside Prometheus
Metrics tell you a service is slow or hungry; profiling tells you which line of code is to blame. Here's how Grafana Pyroscope adds the fourth pillar next to your Prometheus stack.
Read guide - · 8 min read
Metric Naming Standards That Keep Prometheus Sane
Inconsistent metric names turn dashboards and alerts into archaeology. A naming convention for units, suffixes, and labels makes every metric predictable and queryable.
Read guide - · 9 min read
Multi-Window Burn-Rate Alerts for SLOs That Work
Single-threshold error alerts either page too late or too often. Multi-window multi-burn-rate alerting catches fast disasters and slow leaks without crying wolf. Here's the PromQL.
Read guide - · 8 min read
Prometheus Exemplars and Trace Links: Metrics to Traces
A latency spike on a dashboard tells you something is slow but not which request. Exemplars bridge metrics to traces so one click jumps from a p99 bump to the exact slow trace.
Read guide - · 9 min read
Prometheus Operator and kube-prometheus-stack Explained
Stop hand-editing prometheus.yml in Kubernetes. The Prometheus Operator turns scrape config and alerts into CRDs. Here's how ServiceMonitors and the stack actually fit together.
Read guide - · 9 min read
Prometheus Scrape Config and Relabeling Deep Dive
Relabeling is the most powerful and most confusing part of Prometheus. Master relabel_configs and metric_relabel_configs to control targets, labels, and cardinality.
Read guide - · 9 min read
Running Grafana Mimir at Scale: Multi-Tenant Metrics
Mimir promises a billion active series and multi-tenancy, but its microservices sprawl bites teams that deploy it naively. Here's how to run it without drowning in components.
Read guide - · 9 min read
VictoriaMetrics vs Prometheus: When to Switch and Why
Prometheus is the default, but at scale its memory appetite and single-node TSDB start to hurt. Here's an honest comparison with VictoriaMetrics and when to migrate.
Read guide - · 9 min read
Distributed Tracing With Grafana Tempo Alongside Prometheus
Metrics tell you something is slow; traces tell you where. Here's how to run Grafana Tempo next to Prometheus and use exemplars to jump from a latency spike to the exact trace.
Read guide - · 9 min read
Instrumenting Services With the OpenTelemetry Collector for Prometheus
The OpenTelemetry Collector is the most useful box in a modern monitoring stack — and the easiest to misconfigure. Here's how to wire it into Prometheus without losing data or your mind.
Read guide - · 9 min read
kube-state-metrics vs node_exporter: Monitoring Kubernetes Right
These two exporters answer completely different questions, and conflating them is why Kubernetes dashboards lie. Here's what each one knows and the PromQL that puts them together.
Read guide - · 8 min read
node_exporter Deep Dive: The Host Metrics That Actually Matter
node_exporter spits out thousands of series, but you reach for maybe twenty. Here are the host metrics I trust, the PromQL to compute them, and the collectors to turn off.
Read guide - · 9 min read
Prometheus High Availability and Federation, Done Right
Running two Prometheus replicas and federating across clusters sounds simple until the graphs flicker and the cardinality explodes. Here's the architecture that actually holds up.
Read guide - · 8 min read
Prometheus Pushgateway: When to Use It and When Not To
The Pushgateway is the most misused component in the Prometheus ecosystem. Here's the narrow set of jobs it's actually for, the traps it sets, and what to use instead.
Read guide - · 9 min read
Reducing Alert Fatigue With the USE and RED Methods
Most alert fatigue comes from alerting on causes instead of symptoms. The USE and RED methods give you a small, durable set of signals worth a human's sleep. Here's how to apply them in Prometheus.
Read guide - · 9 min read
Tuning Prometheus Remote Write for Reliable Metric Shipping
Remote write is how Prometheus feeds Thanos, Mimir, and Grafana Cloud — and the default queue settings will drop samples under load. Here's how to tune it so they don't.
Read guide - · 8 min read
Alertmanager Routing Without Losing Your Mind
Alertmanager's routing tree, grouping, and inhibition decide who gets paged and when. Here's how I configure it so the right person hears the right alert.
Read guide - · 8 min read
Blackbox and Synthetic Monitoring With Prometheus
Internal metrics tell you the server is fine while users get errors. Here's how I use the blackbox exporter to probe from the outside, like a user.
Read guide - · 8 min read
Building Grafana Dashboards People Actually Use
Most dashboards are graph graveyards no one reads during an incident. Here's how I build Grafana dashboards that answer real questions fast.
Read guide - · 9 min read
Designing Alert Rules That Don't Page You Falsely
A pager that cries wolf trains people to ignore it. Here's how I design Prometheus alert rules that fire on real problems and stay quiet otherwise.
Read guide - · 9 min read
Long-Term Prometheus Storage: Thanos vs Mimir, Explained
Prometheus keeps weeks of data, not years. Here's how Thanos and Mimir give you durable, queryable, long-term metrics — and how to choose.
Read guide - · 8 min read
Prometheus Recording Rules That Make Slow Queries Fast
Recording rules precompute expensive PromQL so dashboards and alerts stay snappy. Here's how I decide what to record and how to name it.
Read guide - · 9 min read
SLOs and Error Budgets With Prometheus, the Practical Way
SLOs turn 'is it healthy?' into a number you can act on. Here's how I define SLIs, set realistic SLOs, and compute error budgets in PromQL.
Read guide - · 9 min read
Taming Prometheus Metric Cardinality Before It Tames You
High cardinality is the number one way to kill a Prometheus server. Here's how I find the offending labels and cut cardinality without losing signal.
Read guide - · 8 min read
Prometheus Exporters: Choosing the Right One and Writing Your Own
Exporters turn anything into Prometheus metrics. Here's how I pick a good off-the-shelf exporter and write a custom one when none exists.
Read guide - · 6 min read
Reading Loki Logs With AI: Patterns That Work
Loki query syntax is unfamiliar to most engineers. AI can help write LogQL, but it can also produce queries that look right and return nothing. Here's how to use it well.
Read guide - · 6 min read
AI Prompt Templates for Prometheus Alerting
Production-ready prompt templates for generating Prometheus alert rules with proper thresholds, runbook annotations, and false-positive analysis.
Read guide
Recommended tools
-
Claude
by Anthropic
4.8The most cautious and context-aware AI assistant for infrastructure work.
- Best for
- Production troubleshooting, postmortems, IaC review
- Pricing
- Free tier; Pro $20/mo; Team & Enterprise tiers
Read review -
ChatGPT
by OpenAI
4.6The broadest AI ecosystem with deep plugin support and the largest user community.
- Best for
- Ansible/Terraform generation, fast scaffolding, plugin-heavy workflows
- Pricing
- Free tier; Plus $20/mo; Team & Enterprise tiers
Read review -
Datadog Bits AI
by Datadog
4.2An AI SRE inside Datadog — auto-investigates alerts, queries your telemetry in plain English, and accelerates incident triage.
- Best for
- Investigating alerts and incidents inside Datadog, natural-language queries across metrics/logs/traces
- Pricing
- Bundled with Datadog; AI features vary by plan. Datadog billed per host/usage (often expensive at scale)
Read review -
Microsoft Copilot for Azure
by Microsoft
4.0An AI assistant inside the Azure portal that knows your environment — generate Bicep/CLI, troubleshoot AKS, and query Log Analytics in plain English.
- Best for
- Managing & troubleshooting Azure resources, generating Bicep/CLI, AKS diagnostics, KQL authoring
- Pricing
- Included with Azure at no additional charge (standard Azure resource usage applies)
Read review