Generating Remediation Code From Incidents With AI

The worst part of an incident isn’t the incident. It’s solving the same one again four months later, at the same hour, because the fix lived in one engineer’s shell history and nowhere else. I’ve watched a team rediscover the same disk-cleanup dance three separate times. The manual heroics happen, the postmortem gets written, and the actual fix — the sequence of commands that worked — evaporates. AI changes the economics of that loss. It can take the timeline of what you did and draft the playbook you should have had. But “draft” is the operative word. It writes like a fast junior engineer: quick, mostly right, and absolutely not allowed to merge its own code or touch prod.

Here’s the workflow that turns a one-time manual fix into durable automation without letting the model anywhere near a credential.

Feed the model the timeline, not the keys

The raw material is the incident itself: the timeline, the commands actually run, the output that confirmed the fix. That’s a text artifact, and text is all the model ever sees. You hand it the story of the remediation and ask for the code — never an SSH key, never a kubeconfig, never a token.

import anthropic

client = anthropic.Anthropic()  # ANTHROPIC_API_KEY only — no infra creds in scope

incident = """
INCIDENT-4471  disk full on log shippers
- 02:10 PagerDuty: /var at 98% on log-ship-{01..04}
- 02:14 ran: du -sh /var/log/* | sort -h  -> journal 6G, old gz logs 9G
- 02:19 ran: journalctl --vacuum-size=500M
- 02:21 ran: find /var/log -name '*.gz' -mtime +14 -delete
- 02:25 /var back to 41%, alert cleared
- root cause: logrotate maxage not set, no journal cap
"""

resp = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=2048,
    system=(
        "You are a senior SRE drafting reusable automation from an incident. "
        "Output a single idempotent Ansible task file. Every task MUST be safe "
        "to re-run. No shell where a module exists. Add a # REVIEW comment on any "
        "destructive step. You are drafting for human review — do not assume merge."
    ),
    messages=[{"role": "user", "content": incident}],
)
print(resp.content[0].text)

Notice what the system prompt insists on: idempotency, modules over shell, and an explicit acknowledgment that a human reviews this. I’m setting the model’s expectations the same way I’d set a junior’s — here’s the standard, and your code isn’t going in until someone signs off.

Demand idempotent code, because automation re-runs

The manual fix ran once. The automation will run a hundred times, sometimes on hosts that are already fine. A find ... -delete is fine once; an idempotent module is fine forever. The difference between a remediation script and a playbook is that the playbook can run against an already-healthy host and do nothing. Here’s the kind of task I want back:

# roles/log_hygiene/tasks/main.yml — idempotent by construction
- name: Cap systemd journal size persistently
  ansible.builtin.copy:
    dest: /etc/systemd/journald.conf.d/size.conf
    content: |
      [Journal]
      SystemMaxUse=500M
    owner: root
    mode: "0644"
  notify: restart journald          # only fires if the file actually changed

- name: Enforce logrotate maxage so .gz logs don't pile up
  ansible.builtin.lineinfile:
    path: /etc/logrotate.conf
    regexp: '^\s*maxage'
    line: "maxage 14"
    state: present                  # converges; re-runs are no-ops

- name: Remove archived logs older than 14 days   # REVIEW: destructive
  ansible.builtin.find:
    paths: /var/log
    patterns: "*.gz"
    age: 14d
    age_stamp: mtime
  register: stale_logs

- name: Delete the stale archives found
  ansible.builtin.file:
    path: "{{ item.path }}"
    state: absent
  loop: "{{ stale_logs.files }}"
  loop_control: { label: "{{ item.path }}" }

This is materially better than the 2am one-liners. It fixes the root cause (journal cap, logrotate maxage) instead of just the symptom, and the cleanup is expressed as converging state. The model did the tedious translation from “commands I ran” to “state I want.” That’s exactly the junior-engineer job: fast, structured, and in need of a senior’s eyes.

Pro Tip: Idempotency is the one thing you cannot take the model’s word for. Prove it by running the playbook twice in dry-run — the second pass must report zero changes. If pass two still reports changed, a task is non-converging and will fight itself forever.

A human reviews it — every line, no exceptions

The model wrote a draft. A person now owns it. This isn’t a rubber stamp; it’s the part where the senior engineer reads each task and asks the questions the model can’t: Is /var/log really the right scope, or did we just teach a fleet to delete things? Is 14 days correct for this environment’s compliance retention? Does lineinfile on logrotate.conf clobber a setting another role manages?

I route generated remediation through the same review surface as everything else — the code review dashboard — so an AI-drafted playbook gets the identical scrutiny a hand-written one would. The # REVIEW comments the model left on destructive steps become the reviewer’s checklist. Approval here is a human accepting authorship. From the moment it merges, it’s the team’s code, not the model’s.

Dry-run before it’s allowed near prod

Reviewed isn’t proven. Before this playbook touches a real host, it runs in check mode against a canary — same discipline I’d apply to any change, covered in depth in the dry-run and simulation thinking on evidence gates.

# Prove idempotency and predict effect — no mutation.
ansible-playbook log_hygiene.yml --check --diff --limit log-ship-canary

# Run it for real on the canary only, then re-check: must be 0 changed.
ansible-playbook log_hygiene.yml --limit log-ship-canary
ansible-playbook log_hygiene.yml --check --limit log-ship-canary   # expect: ok=N changed=0

That second --check is the idempotency proof in practice. If the playbook genuinely converged the canary, re-checking it reports zero changes. If it doesn’t, the playbook goes back to review before it ever sees the fleet. The model’s draft has to earn its way to prod through evidence, not assertion.

Commit via PR — never apply, never push to main

The generated code lands in the runbook repo as a pull request, not an apply. This is the credential boundary again, expressed in git: the model produced text, the text becomes a PR, and a human merges it. The model has no write access to the repo and no path to production. The whole pipeline is gated on a person clicking merge.

# .github/workflows/remediation-pr.yml
name: ai-remediation-draft
on:
  workflow_dispatch:
    inputs:
      incident_id: { required: true }
jobs:
  draft:
    runs-on: ubuntu-latest
    permissions:
      contents: write
      pull-requests: write        # may open a PR; may NOT merge it
    steps:
      - uses: actions/checkout@v4
      - name: Generate playbook from incident
        run: python gen_remediation.py "${{ inputs.incident_id }}" \
               --out roles/generated/${{ inputs.incident_id }}/tasks/main.yml
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}   # only secret present

      - name: Lint and dry-run gate
        run: |
          ansible-lint roles/generated/${{ inputs.incident_id }}
          ansible-playbook --syntax-check roles/generated/${{ inputs.incident_id }}/tasks/main.yml

      - name: Open PR for human review
        uses: peter-evans/create-pull-request@v6
        with:
          branch: ai/remediation-${{ inputs.incident_id }}
          title: "Draft remediation for ${{ inputs.incident_id }} (AI — review required)"
          body: |
            Auto-drafted from incident timeline. **Do not merge without review.**
            - [ ] Idempotency proven (double dry-run, 0 changed)
            - [ ] Destructive steps (`# REVIEW`) confirmed safe
            - [ ] Blast radius and back-out documented
          labels: ai-generated, needs-review

The job can open a PR. It cannot merge one — branch protection requires a human approval, and ansible-lint plus --syntax-check block obviously broken drafts before a reviewer wastes time on them. The only secret in the whole workflow is the Anthropic key; there isn’t an AWS credential or a prod kubeconfig anywhere in scope, by design.

Pro Tip: Put the back-out path in the PR template as a required checkbox. A remediation playbook without a documented reverse isn’t done — and forcing the reviewer to write it is how you guarantee the rollback exists before the forward action ships.

Close the loop in the postmortem

The last step makes this a habit instead of a one-off. The postmortem action item isn’t “be more careful with disk.” It’s “merge PR #812, the generated log_hygiene role.” The incident produced durable, reviewed, tested automation that prevents the next occurrence — and the next time /var creeps up, the playbook already exists. This is how toil actually dies, the theme of identifying and eliminating toil with AI: each incident pays for itself by becoming code. The prompts I use for this live in the prompt library and prompt packs.

AI is genuinely good at the tedious translation from “what I did at 2am” to “the idempotent code we should keep.” Let it draft like the fast junior it is. Then make a human review every line, prove idempotency in dry-run, and merge through a PR a person owns — never an apply, never prod creds in the model’s hands. Do that, and your incidents stop being losses and start compounding into a runbook that gets stronger every time something breaks.