ChatOps Approval Gates for AI-Suggested Actions

I used to think the danger with AI remediation was bad suggestions. It isn’t. The danger is good suggestions that run without anyone deciding to run them. A model can propose the right fix and still be wrong about whether now is the time, whether this is the right host, or whether the person who’d object is awake. The fix for that isn’t a smarter model. It’s a button. A human reads the proposal in Slack, sees the blast radius and the back-out, and clicks Approve — and only then does anything happen. The model suggests; a person commits. That gap, mediated by ChatOps, is where I put all my trust.

Here’s the architecture I run, end to end, and the rule that holds it together: the bot holds the credentials, the model never does, and execution happens only after a verified human says yes.

The interaction pattern: propose, don’t act

When an alert fires, the AI doesn’t remediate. It drafts a proposal and posts it. The proposal is a complete package, because a human can’t approve what they can’t see. Every card carries four things: the proposed action, the plan in plain English, the blast radius, and the back-out path.

That structure isn’t decoration. It’s the minimum a person needs to make a real decision in fifteen seconds at 3am. “Restart the pod” is not approvable. “Restart api-7f9 on prod-cluster, affects 1 of 6 replicas, back-out is kubectl rollout undo” is approvable. The model is a fast junior engineer writing up the proposal; the senior on call decides.

Post a Block Kit card with Approve and Deny

Slack’s Block Kit gives you interactive buttons. The bot posts this; nothing executes yet. The value carries an opaque action ID that maps to a stored, scoped plan on the server — never the command itself, so a spoofed payload can’t smuggle in rm -rf.

{
  "channel": "C0INCIDENT",
  "blocks": [
    { "type": "header",
      "text": { "type": "plain_text", "text": "🤖 Proposed remediation — needs approval" } },
    { "type": "section",
      "text": { "type": "mrkdwn",
        "text": "*Action:* restart deployment `api` (rolling)\n*Why:* OOMKilled x4 in 5m on `api-7f9`\n*Blast radius:* 1/6 replicas at a time, prod\n*Back-out:* `kubectl rollout undo deployment/api`" } },
    { "type": "actions",
      "block_id": "remediation_gate",
      "elements": [
        { "type": "button", "style": "primary",
          "text": { "type": "plain_text", "text": "✅ Approve" },
          "action_id": "approve",
          "value": "act_5f3c9a",
          "confirm": {
            "title": { "type": "plain_text", "text": "Run this in production?" },
            "text": { "type": "mrkdwn", "text": "Restarts `api`. You own this decision." },
            "confirm": { "type": "plain_text", "text": "Approve" },
            "deny": { "type": "plain_text", "text": "Cancel" } } },
        { "type": "button", "style": "danger",
          "text": { "type": "plain_text", "text": "🚫 Deny" },
          "action_id": "deny", "value": "act_5f3c9a" }
      ] }
  ]
}

The confirm dialog is a deliberate speed bump — it makes the human acknowledge they own the decision before the action runs. One click is too easy to misfire.

Authorize the approver — not everyone can say yes

A button anyone can press is not a gate; it’s a suggestion box. The handler must verify who clicked before it does anything. Approval authority is a property of the human, checked against an allowlist or your IdP group, not a property of the channel.

import time, hmac, hashlib, logging
from fastapi import FastAPI, Request, HTTPException

app = FastAPI()
APPROVERS = {"U_ALICE", "U_BOB"}          # on-call leads only
SIGNING_SECRET = b"<slack signing secret>"
PENDING = {}  # act_id -> {"plan":..., "created": ts}  populated when card posts
APPROVAL_TTL = 600   # 10 minutes

def verify_slack_sig(req_body: bytes, ts: str, sig: str):
    if abs(time.time() - int(ts)) > 60 * 5:
        raise HTTPException(403, "stale request")     # replay protection
    base = b"v0:" + ts.encode() + b":" + req_body
    mine = "v0=" + hmac.new(SIGNING_SECRET, base, hashlib.sha256).hexdigest()
    if not hmac.compare_digest(mine, sig):
        raise HTTPException(403, "bad signature")

@app.post("/slack/interactions")
async def interactions(request: Request):
    raw = await request.body()
    verify_slack_sig(raw, request.headers["X-Slack-Request-Timestamp"],
                     request.headers["X-Slack-Signature"])
    payload = parse_payload(raw)            # Slack form-encodes a JSON blob
    user = payload["user"]["id"]
    action = payload["actions"][0]
    act_id = action["value"]

    if action["action_id"] == "deny":
        audit(act_id, user, "denied"); return {"text": "Denied. Nothing ran."}

    # --- the four gates ---
    if user not in APPROVERS:
        audit(act_id, user, "rejected_unauthorized")
        raise HTTPException(403, "you are not an approver")
    plan = PENDING.get(act_id)
    if plan is None:
        raise HTTPException(410, "unknown or already-handled action")
    if time.time() - plan["created"] > APPROVAL_TTL:
        del PENDING[act_id]; audit(act_id, user, "expired")
        return {"text": "⏱️ Approval window expired. Re-run the diagnosis."}

    del PENDING[act_id]                     # consume — one approval, one run
    audit(act_id, user, "approved")
    result = execute_scoped(plan["plan"])   # bot's creds, not the model's
    return {"text": f"✅ Ran (approved by <@{user}>). {result}"}

Three things are doing the heavy lifting. The signature check proves the request really came from Slack (and isn’t a replay). The APPROVERS set proves the clicker is allowed. And popping from PENDING proves an action runs at most once — no double-execution, no stale re-approval.

Time-box the approval

An approval is only valid in context. A proposal to restart a pod made sense when the pod was OOMing; if someone approves it forty minutes later, the situation may have changed entirely. So approvals expire. The APPROVAL_TTL above kills any proposal older than ten minutes. After that, the operator has to re-run diagnosis and get a fresh proposal — which is correct, because the world moved on.

Pro Tip: Set the TTL shorter than your alert’s natural recovery time. If an alert auto-resolves in fifteen minutes, a twenty-minute approval window means you might “fix” something that already healed itself.

The bot executes with a scoped service account — the model never gets creds

This is the rule everything else exists to protect. The language model produces text: a proposed plan, a summary, an explanation. It is never handed a kubeconfig, an AWS key, or a database password. Execution happens in execute_scoped, which uses a narrow service account the bot owns — one that can roll a deployment but cannot, say, delete a namespace or read secrets.

import subprocess

# RBAC for this SA allows ONLY: patch/rollout on deployments in ns=prod-api.
# It cannot create, delete, exec, or touch secrets. Blast radius is capped
# at the IAM/RBAC layer, not by trusting the caller to behave.
SA_KUBECONFIG = "/etc/chatops/scoped-kubeconfig"

def execute_scoped(plan: dict) -> str:
    cmd = plan["argv"]                      # pre-vetted argv list, never a shell string
    allowed = {"rollout", "scale", "annotate"}
    if cmd[1] not in allowed:               # defense in depth atop RBAC
        raise PermissionError(f"verb {cmd[1]} not permitted via chatops")
    out = subprocess.run(
        ["kubectl", f"--kubeconfig={SA_KUBECONFIG}", *cmd],
        capture_output=True, text=True, timeout=30,
    )
    return out.stdout or out.stderr

Two layers of scoping: the service account’s RBAC physically can’t do dangerous things, and the handler refuses verbs outside an allowlist. Even if the model hallucinated kubectl delete ns prod, the SA has no permission to honor it and the verb check rejects it first. The credential boundary is the hard wall. I dig into the credential-isolation pattern more in building self-healing infrastructure with AI, and you can wire AI proposals straight into the incident response dashboard.

Audit everything — the log is the receipt

Every proposal, approval, denial, expiry, and unauthorized click gets written down with who, what, and when. Not for blame — for trust. When someone asks “why did the API restart at 3:14am,” the answer is a row: proposed by the bot, approved by Bob, executed by the scoped SA, here’s the back-out that was on the card.

def audit(act_id, user, decision):
    logging.info("chatops_audit", extra={
        "action_id": act_id, "actor": user,
        "decision": decision, "ts": time.time(),
    })  # ship to your immutable log store, not just stdout

The audit trail closes the loop: a machine proposed, a named human decided, a scoped account acted, and there’s a record. That’s the same accountability backbone behind runbooks engineers trust at 3am. You can find the Slack proposal prompts I use in the prompt packs and across the automation category.

ChatOps approval gates turn AI from an actor into an advisor — a fast, tireless one that drafts the fix and shows its work, while a verified human owns the decision and a scoped account does the touching. Keep the credentials in the bot, keep the model in the text, and time-box every yes. Do that and you get the speed of automation without ever ceding the one thing that should never be automated: the choice to act in production.