Configuring PagerDuty and Opsgenie for Incident Response

Almost every team I’ve joined has a paging tool that was configured by someone who left two years ago, and nobody has touched the escalation policies since. It works until the night it doesn’t — the alert routes to a schedule with a gap, the escalation never fires, and an incident sits unacknowledged for forty minutes because the tool was set up wrong, not because anyone was asleep at the wheel.

PagerDuty and Opsgenie are different products, but the building blocks are the same, and the mistakes are the same. Here’s how to configure either so paging is something you trust rather than something you hope works.

The four building blocks

Whichever tool you use, you’re assembling the same primitives:

Service — the thing being monitored (a microservice, a cluster, a customer-facing product). Alerts attach to a service.
Escalation policy — the ordered ladder of who gets notified, and how long the tool waits before moving to the next rung.
Schedule (on-call rotation) — who is on the hook right now, on a rotating calendar.
Routing / integration — how an alert from Prometheus, Datadog, or a healthcheck actually becomes a page.

Get these four right and the rest is decoration. Get any one wrong and you have a silent pager.

Designing services around ownership, not topology

The most common modeling mistake is creating one giant “Production” service that catches every alert. It routes everything to one schedule, buries signal, and makes it impossible to measure which system is actually noisy.

Model services around who owns the fix. If two systems are owned by the same team and always fixed by the same people, one service is fine. If they’re owned by different teams, split them — even if they share infrastructure. The service boundary should answer “whose problem is this?” instantly.

Escalation policies that actually reach someone

This is where silent pages come from. A solid escalation policy has these properties:

Multiple rungs. Primary on-call gets the page. If unacknowledged after N minutes (5 is typical for SEV1-class), it escalates to secondary, then to a team lead or manager.
No gaps in the schedule. The number-one cause of a missed page is an escalation pointing at a schedule with an uncovered window — a handoff hole, a deleted user, a holiday nobody filled. Both tools can show coverage gaps; check them.
Escalation to a human, not a void. The final rung should be a person who is contractually, organizationally guaranteed to respond — usually an engineering manager or a duty director. The buck has to stop somewhere real.
Time-outs short enough to matter. If the first escalation waits 15 minutes, you’ve burned a quarter of a SEV1 response budget before the second person even knows. Tune the timeout to the severity.

In PagerDuty these are Escalation Policies with timeout rules; in Opsgenie they’re Escalation entities with “if not acked in X minutes” steps. Same shape, different menu.

Routing: turn alerts into the right page

The integration layer is where you decide which alerts page and which just log. Resist the urge to page on everything.

Severity-based routing. Use event rules (PagerDuty) or alert policies (Opsgenie) to map alert severity to actions: critical pages immediately, warning creates a low-priority alert with no notification, info is suppressed entirely.
Deduplication keys. Set a dedup/alias key so a flapping alert collapses into one incident instead of fifty pages. This is the single biggest noise reducer available to you, and it’s one config field.
Auto-resolve. Wire the resolve signal back so a recovered alert closes its incident automatically. Stale “still open?” incidents erode trust in the queue.

A starter configuration checklist

Whether you’re standing up a new account or auditing an old one, walk this list:

Every production service has exactly one owning team and one escalation policy.
Every escalation policy has at least two notification rungs plus a guaranteed final responder.
Every schedule has zero coverage gaps for the next 90 days.
High-severity escalation timeouts are 5 minutes or less.
Each integration sets a dedup key and an auto-resolve path.
Notification rules cover more than one channel (push + SMS + phone call for critical).
There is a documented “override” path to grab someone immediately, bypassing the ladder.

Test it like you mean it

Configuration you haven’t tested is configuration you’re hoping works. Run a deliberate test page each on-call handoff: trigger a synthetic alert and confirm it pages the right person, escalates on timeout, and auto-resolves. It takes two minutes and it surfaces the dead schedule before a real incident does.

Both tools also support maintenance windows — use them during planned work so you don’t page yourself during a deploy and train people to ignore the pager. A pager that cries wolf during every release is a pager people silence.

Keep it as code where you can

Both PagerDuty and Opsgenie have Terraform providers. Managing services, escalation policies, and schedules as code gives you review, history, and the ability to recreate the whole setup if an account is misconfigured or lost. It also makes “why did this route here?” answerable from a diff instead of archaeology through a UI.

You don’t have to do this on day one, but the day your escalation policies sprawl past what one person can hold in their head, codifying them pays for itself.

Where it connects to the rest of incident response

Paging tools are the trigger, not the response. Once the page fires, your severity matrix, runbooks, and roles take over. We keep prompts and templates for the whole flow in our incident-response collection — the goal is that the page reaches the right human fast, and everything they do next is already written down.

Configuration recommendations are general starting points. Validate every escalation path and schedule against your own org structure before relying on it for production paging.