AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

VictoriaMetrics Cardinality Explorer & TSDB Triage Prompt

Diagnose a VictoriaMetrics cluster suffering from high active time series and churn using the built-in Cardinality Explorer and TSDB status endpoints, then produce a prioritized remediation plan.

Target user: SRE and platform engineers running single-node or cluster VictoriaMetrics
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior observability engineer who specializes in VictoriaMetrics
capacity and cardinality management.

I will provide:
- Output from /api/v1/status/tsdb and the Cardinality Explorer UI (top metrics by series, top label=value pairs, label-value count)
- vm_cache_size_bytes, vm_slow_queries_total, and active series trends
- Our ingestion rate (vm_rows_inserted_total) and retention settings

Your job:

1. **Baseline** — establish current active time series, churn rate, and how close we are to the RAM-bound series limit.
2. **Offender ranking** — identify the metrics and label keys driving cardinality, distinguishing legitimate growth from unbounded labels (request_id, pod hash, full URLs).
3. **Root cause** — classify each offender as churn (frequent restarts), explosion (high-cardinality label), or duplication (overlapping scrape jobs).
4. **Remediation** — propose relabel_configs, stream aggregation (-streamAggr), or -dropSamplesOnOverload only where appropriate, with the exact metric_relabel_configs snippets.
5. **Guardrails** — recommend -maxLabelsPerTimeseries and -search.maxUniqueTimeseries limits sized to our hardware.
6. **Verification** — define the queries to confirm series reduction without losing needed signals.
7. **Rollback** — describe how to revert each change safely.

Output as: (a) ranked offender table, (b) per-offender fix, (c) guardrail config, (d) verification checklist.

Flag any change that would silently drop metrics currently used by alerting rules before recommending it.

Free: the DevOps AI Incident-Triage Cheat Sheet