Migrating Nagios Checks to Prometheus Alerts With AI

Every Nagios-to-Prometheus migration I’ve seen carries a hidden temptation: just translate the checks one-for-one and call it done. The checks already exist, the team trusts them, and AI can convert a check_command definition to a Prometheus alert rule in seconds. But a Nagios config that’s accumulated for eight years is an archaeology site — dead checks, duplicated thresholds, alerts nobody acts on — and a mechanical port carries every bit of that rot into your shiny new stack. AI makes the translation fast, which is exactly why you have to slow down and decide what’s worth migrating. Here’s how I use it without recreating a decade of alert fatigue.

The two-model migration: translate, then triage

I split the work the model does into two distinct passes, because they need different prompts and different scrutiny. The first pass is pure translation: take this Nagios check and express the equivalent condition in PromQL. The second pass is triage: should this alert exist at all, and if so, what should it actually do? AI is fast and reliable at the first and useless at the second, because triage is a judgment about your team that the model has no access to.

Translation: where the model genuinely earns its keep

A Nagios threshold check maps cleanly to a PromQL expression, and the model does this well. A disk check like check_disk -w 20% -c 10% becomes a Prometheus expression the AI writes correctly because it’s seen the node_filesystem metrics thousands of times:

- alert: "DiskSpaceLow"
  expr: |
    (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
      / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) < 0.10
  for: 15m
  labels:
    severity: critical
  annotations:
    summary: "Less than 10% disk free on {{ $labels.instance }}:{{ $labels.mountpoint }}"

The model even knows to exclude tmpfs and add a for: window — things the Nagios check did implicitly through its check interval and retry count. This is the fast-junior-engineer part: it’s read every node_exporter doc and produces a correct first draft instantly.

The retry-count trap

Nagios has max_check_attempts and retry_interval, which together mean “don’t alert until the condition has held across N checks.” That’s Nagios’s version of for:, and a naive translation drops it entirely, turning a deliberately patient check into a hair-trigger Prometheus alert that fires on the first scrape. I specifically ask the model to map Nagios’s soft/hard state logic onto a for: duration and to show its arithmetic:

Nagios: max_check_attempts=3, normal_check_interval=2m
=> condition must hold ~6m before HARD state
=> Prometheus equivalent: for: 6m

If the model translates a check without accounting for retry behavior, the new alert is louder than the old one, and the team’s first impression of Prometheus is “it pages more than Nagios did.” That kills migrations.

Pro Tip: For each Nagios check, ask the AI to estimate how many times it would have fired in the last 90 days based on the threshold and your typical metric values. Checks the model flags as “would have fired hundreds of times” are alert-fatigue generators — migrate the metric, but reconsider whether it should page at all. This turns the migration into a cleanup.

Triage: the checks you should NOT migrate

This is the pass that determines whether the migration improves anything. Plenty of Nagios checks are vestigial — they monitor a service that’s been decommissioned, duplicate another check, or page on a condition that’s auto-remediated now. The model can’t know this, so I feed it the check alongside context and ask it to flag candidates for retirement, not to decide:

Checks on hosts that no longer report metrics
Multiple checks on the same underlying condition with different thresholds
Checks whose “fix” is always “restart it and ignore”

The model proposes; a human who knows the history disposes. Every retirement is a deliberate decision, because the one check you wrongly drop is the one that catches the next outage.

Map Nagios contact groups to Alertmanager routes

Nagios routing lives in contact groups and escalations; Prometheus routing lives in Alertmanager’s tree. The model translates the structure well but can’t know your current on-call reality, so I have it produce a draft route tree and then I reconcile it against who actually carries the pager now:

route:
  receiver: default
  routes:
    - matchers: ['team = "database"']
      receiver: dba-oncall
    - matchers: ['severity = "critical"']
      receiver: pagerduty-high

I verify each team label exists on the migrated alerts, because a route that matches a label nobody sets is a silent black hole — the Prometheus equivalent of a Nagios contact group with no members.

Validate the migration with tests, not faith

Before any migrated rule goes live I generate promtool tests for it, especially the negative case proving the new for: doesn’t fire faster than the old retry logic. A migration is a perfect place for tests because you have a known baseline — the old check’s behavior — to assert against. Because the rules land in Git, our code review dashboard ensures no migrated rule merges without review of its labels, thresholds, and runbook annotations.

Run both systems in parallel before cutting over

The riskiest moment in any migration is the cutover, and AI can’t tell you when it’s safe — only the data can. I keep Nagios and Prometheus running side by side for a few weeks and compare what each fires on. The model helps here by analyzing the two alert streams and flagging mismatches: conditions Nagios caught that Prometheus missed, and the reverse. A Prometheus rule that stays silent while the equivalent Nagios check fires is a translation bug, usually a threshold or for: that drifted during conversion. A Prometheus rule that fires while Nagios stayed quiet is often the new system catching something the old one couldn’t — but I verify that before celebrating, because it’s just as likely to be a too-sensitive for:. Only once the two streams agree on real incidents do I retire the Nagios check. The model surfaces the discrepancies; a human decides which system was right each time, and that decision is the actual proof the migration worked.

Keep a human explaining every rule

The migration succeeds or fails on a simple principle: every rule that lands in Prometheus must be explainable by a human, not just “this is what Nagios did.” If nobody can say why the threshold is right and the alert is worth a page, that’s a check to retire, not migrate. The AI is the fast translator and the tireless flagger of suspects; the engineer is the one who decides what survives. I draft the translations in Claude and refine routing in Cursor, then run finished drafts through the Alert Rule Generator to enforce consistent severity and runbook annotations.

Conclusion

AI makes a Nagios-to-Prometheus migration fast enough to finish, which is both its gift and its danger. Let it translate thresholds and map retry logic to for: durations, let it flag retirement candidates — but make every migrated rule pass through human triage and promtool tests before it pages anyone. A migration is a rare chance to delete years of alert rot; don’t let the speed of translation tempt you into porting it instead. More patterns in the monitoring guides, and migration prompt templates in the prompts library.