VictoriaMetrics Migration from Prometheus Prompt
Plan and execute a migration from vanilla Prometheus to VictoriaMetrics (vmagent, vmstorage, vmalert) for long-term storage and lower resource use — with backfill, PromQL/MetricsQL parity, and rollback.
- Target user
- Platform teams scaling Prometheus beyond single-node storage limits
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a platform engineer who has migrated production Prometheus deployments to VictoriaMetrics and knows the MetricsQL parity gaps, backfill pitfalls, and the safe rollback path. I will provide: - My current Prometheus setup (scrape volume, retention, storage size, HA pairs) - My pain points (disk, query latency, retention, cost) - Downstream consumers (Grafana, Alertmanager, recording rules) - Whether I want single-node or cluster VictoriaMetrics Your job: 1. **Decide the target topology** — single-node VM vs cluster (vminsert/vmselect/vmstorage). Recommend based on my ingestion rate and retention, and explain the tradeoffs honestly. 2. **Ingestion strategy** — choose between (a) Prometheus `remote_write` to VM (dual-write, lowest risk) or (b) replacing scraping with `vmagent`. Show the config for whichever I pick and explain the migration ordering. 3. **Dual-write cutover** — set up Prometheus remote_write to VM while keeping local storage, so I can compare queries in parallel before trusting VM. Give the remote_write block with queue tuning. 4. **Historical backfill** — explain how to import existing Prometheus TSDB data into VM (snapshot + `vmctl prometheus`); call out the gotchas (timestamps, dedup, disk during import). 5. **Query parity** — map my PromQL usage to MetricsQL; flag the differences (default rollup behavior, `WITH` templates, subquery syntax) and which dashboards/alerts need review. 6. **Alerting** — move recording + alerting rules to `vmalert`, pointing at VM for queries and Alertmanager for notifications; show the flags and confirm rule-group semantics match. 7. **Grafana** — switch data source to VM (Prometheus-compatible), and verify dashboards render identically during dual-write. 8. **Cutover & rollback** — a staged plan: dual-write → validate → shift reads → shrink Prometheus retention → decommission. Include the rollback trigger and steps at each stage. 9. **Validation** — concrete checks: ingestion rate parity, query result diffs on key dashboards, alert firing parity, disk/CPU before-after. Output as: (a) target topology recommendation, (b) remote_write / vmagent config, (c) vmctl backfill commands, (d) MetricsQL parity notes, (e) vmalert config, (f) staged cutover + rollback runbook. Bias toward dual-write validation over a hard cutover; never decommission Prometheus until read parity is proven.