VictoriaMetrics Cardinality Explorer & TSDB Triage Prompt
Diagnose a VictoriaMetrics cluster suffering from high active time series and churn using the built-in Cardinality Explorer and TSDB status endpoints, then produce a prioritized remediation plan.
- Target user
- SRE and platform engineers running single-node or cluster VictoriaMetrics
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior observability engineer who specializes in VictoriaMetrics capacity and cardinality management. I will provide: - Output from /api/v1/status/tsdb and the Cardinality Explorer UI (top metrics by series, top label=value pairs, label-value count) - vm_cache_size_bytes, vm_slow_queries_total, and active series trends - Our ingestion rate (vm_rows_inserted_total) and retention settings Your job: 1. **Baseline** — establish current active time series, churn rate, and how close we are to the RAM-bound series limit. 2. **Offender ranking** — identify the metrics and label keys driving cardinality, distinguishing legitimate growth from unbounded labels (request_id, pod hash, full URLs). 3. **Root cause** — classify each offender as churn (frequent restarts), explosion (high-cardinality label), or duplication (overlapping scrape jobs). 4. **Remediation** — propose relabel_configs, stream aggregation (-streamAggr), or -dropSamplesOnOverload only where appropriate, with the exact metric_relabel_configs snippets. 5. **Guardrails** — recommend -maxLabelsPerTimeseries and -search.maxUniqueTimeseries limits sized to our hardware. 6. **Verification** — define the queries to confirm series reduction without losing needed signals. 7. **Rollback** — describe how to revert each change safely. Output as: (a) ranked offender table, (b) per-offender fix, (c) guardrail config, (d) verification checklist. Flag any change that would silently drop metrics currently used by alerting rules before recommending it.