AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

VictoriaMetrics Migration from Prometheus Prompt

Plan and execute a migration from vanilla Prometheus to VictoriaMetrics (vmagent, vmstorage, vmalert) for long-term storage and lower resource use — with backfill, PromQL/MetricsQL parity, and rollback.

Target user: Platform teams scaling Prometheus beyond single-node storage limits
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a platform engineer who has migrated production Prometheus deployments to VictoriaMetrics and knows the MetricsQL parity gaps, backfill pitfalls, and the safe rollback path.

I will provide:
- My current Prometheus setup (scrape volume, retention, storage size, HA pairs)
- My pain points (disk, query latency, retention, cost)
- Downstream consumers (Grafana, Alertmanager, recording rules)
- Whether I want single-node or cluster VictoriaMetrics

Your job:

1. **Decide the target topology** — single-node VM vs cluster (vminsert/vmselect/vmstorage). Recommend based on my ingestion rate and retention, and explain the tradeoffs honestly.

2. **Ingestion strategy** — choose between (a) Prometheus `remote_write` to VM (dual-write, lowest risk) or (b) replacing scraping with `vmagent`. Show the config for whichever I pick and explain the migration ordering.

3. **Dual-write cutover** — set up Prometheus remote_write to VM while keeping local storage, so I can compare queries in parallel before trusting VM. Give the remote_write block with queue tuning.

4. **Historical backfill** — explain how to import existing Prometheus TSDB data into VM (snapshot + `vmctl prometheus`); call out the gotchas (timestamps, dedup, disk during import).

5. **Query parity** — map my PromQL usage to MetricsQL; flag the differences (default rollup behavior, `WITH` templates, subquery syntax) and which dashboards/alerts need review.

6. **Alerting** — move recording + alerting rules to `vmalert`, pointing at VM for queries and Alertmanager for notifications; show the flags and confirm rule-group semantics match.

7. **Grafana** — switch data source to VM (Prometheus-compatible), and verify dashboards render identically during dual-write.

8. **Cutover & rollback** — a staged plan: dual-write → validate → shift reads → shrink Prometheus retention → decommission. Include the rollback trigger and steps at each stage.

9. **Validation** — concrete checks: ingestion rate parity, query result diffs on key dashboards, alert firing parity, disk/CPU before-after.

Output as: (a) target topology recommendation, (b) remote_write / vmagent config, (c) vmctl backfill commands, (d) MetricsQL parity notes, (e) vmalert config, (f) staged cutover + rollback runbook.

Bias toward dual-write validation over a hard cutover; never decommission Prometheus until read parity is proven.

Free: the DevOps AI Incident-Triage Cheat Sheet