VictoriaMetrics vs Prometheus: When to Switch and Why

I ran vanilla Prometheus for years before the memory bills made me look elsewhere. The pattern is predictable: you start with one Prometheus, it works beautifully, and then your cardinality grows, your retention requirements grow, and one morning you’re staring at a Prometheus pod that wants 64GB of RAM and OOM-kills itself during a scrape. That’s usually when VictoriaMetrics enters the conversation.

This is the comparison I wish I’d had before I made the call.

What VictoriaMetrics actually is

VictoriaMetrics (VM) is a time-series database that speaks Prometheus’ protocols. It accepts Prometheus remote_write, it serves PromQL (technically MetricsQL, a backward-compatible superset), and it can scrape targets itself via vmagent. It comes in two flavors: a single-binary version for one node, and a clustered version (vminsert, vmselect, vmstorage) for horizontal scale.

The headline difference is resource efficiency. VM’s storage engine compresses harder and its ingestion path uses dramatically less memory per active time series. In my own migration, the same workload that needed ~48GB on Prometheus ran comfortably under 12GB on VM. Your mileage varies with cardinality, but the direction is consistent.

Where Prometheus still wins

Don’t switch reflexively. Prometheus has real advantages:

It is the reference implementation. Every exporter, every tutorial, every Stack Overflow answer assumes Prometheus. When something is weird, you’re debugging the thing everyone else runs.
The local TSDB is dead simple. One binary, one data directory, no moving parts. For a single team with modest cardinality and 15-day retention, Prometheus is the right boring choice.
PromQL purity. MetricsQL is a superset, which means queries you write against VM may not be portable back to Prometheus.

If your Prometheus isn’t hurting, leave it alone. The best monitoring stack is the one you understand.

The signals that it’s time to look at VM

Concrete triggers, not vibes:

Memory pressure during scrapes. If you’re vertically scaling Prometheus past 32GB just to survive, that’s a smell.
Long retention. You want 6-12 months of metrics without bolting on Thanos or Mimir. VM does long retention natively.
High churn cardinality. Lots of short-lived pods generating new series constantly — VM’s ingestion handles this more gracefully.
Remote-write fan-in. You’re aggregating many Prometheis into one place. VM is a common sink for exactly this.

A low-risk migration path

You don’t have to cut over. Run them side by side first by pointing Prometheus remote_write at VM:

# prometheus.yml
remote_write:
  - url: http://victoriametrics:8428/api/v1/write
    queue_config:
      max_samples_per_send: 10000
      capacity: 20000
      max_shards: 30

Now every sample Prometheus collects is also durably stored in VM with long retention. Your dashboards keep pointing at Prometheus for live data; you point Grafana at VM for historical queries. After a few weeks of confidence, you flip the default datasource and let Prometheus retention shrink to a couple of days as a hot buffer — or replace scraping with vmagent entirely.

vmagent config looks familiar if you know Prometheus scrape config:

# vmagent scrape config
scrape_configs:
  - job_name: node
    static_configs:
      - targets: ["node-exporter:9100"]

It supports the same relabeling, the same service discovery, and adds a persistent on-disk queue so a downstream outage doesn’t drop samples.

Query compatibility in practice

Most dashboards Just Work. The gotchas I hit:

rate() over short ranges behaves slightly differently at boundaries; MetricsQL’s rollup functions are more forgiving but non-portable.
absent() and alerting expressions are compatible, but test your alert rules against VM before trusting them — vmalert is the VM-native rule evaluator and is a near drop-in for Prometheus rules.

Run your existing rule files through vmalert pointed at VM and diff the firing behavior against Prometheus for a week. Alert semantics are exactly where you don’t want surprises.

A capacity comparison query

To decide if you even have a problem, measure your active series first. On Prometheus:

# active series right now
prometheus_tsdb_head_series

# ingestion rate, samples/sec
rate(prometheus_tsdb_head_samples_appended_total[5m])

If prometheus_tsdb_head_series is in the multi-million range and climbing, you’re in VM’s sweet spot. Under a million with stable churn, Prometheus alone is fine.

My recommendation

For a single team, modest scale, short retention: stay on Prometheus. The ecosystem gravity is worth it.

For platform teams running monitoring as a service, with many tenants, high cardinality, and long retention needs: VictoriaMetrics is the more economical engine, and the remote_write compatibility means you migrate without a flag day. Keep Prometheus as the scraper and rule evaluator if you like — VM is happy to be just the storage layer underneath.

Whichever way you go, instrument the decision. Track ingestion rate, active series, and query latency before and after, so the choice is data-driven rather than a rewrite you regret.

For more on the surrounding tooling and alerting patterns, browse the rest of the Prometheus and monitoring guides, and if you want help turning noisy alert YAML into something sane, our monitoring alert assistant reviews rule files for exactly these footguns.

Benchmark numbers depend heavily on your cardinality and query mix. Measure your own workload before committing to a migration.