Turning Tribal Knowledge Into Automation With AI
The senior engineer who just knows how to fix the flaky job. Use AI to extract that tacit knowledge into structured runbooks and safe, idempotent automation.
- #automation
- #runbooks
- #ai
- #knowledge
- #ansible
Every team has a Priya. When the nightly ETL job goes red, six people ping her, she squints at the dashboard for forty seconds, runs three commands nobody else remembers, and it’s green again. She’s never written it down because it’s “obvious,” and it’s only obvious because she’s been carrying it in her head for four years. Then Priya goes on vacation, the job fails, and the whole team relearns at 1 AM what she could have told us in a sentence. That sentence — the tacit, undocumented, “I just know” fix — is the most valuable and most fragile asset most ops teams own. AI is finally a decent tool for prying it loose, but only if you treat the model as a fast junior scribe and Priya as the authority. The model drafts. The human ratifies. Get that backwards and you’ve automated a procedure nobody actually understands.
Tribal knowledge is a liability, not a badge of honor
We romanticize the engineer who “just knows.” It’s a single point of failure wearing a hoodie. The knowledge isn’t documented because writing it down is tedious and the expert is busy doing the very work that would justify documenting. That tedium is the bottleneck — and tedium is exactly what AI eats. The point isn’t to replace Priya’s judgment. It’s to capture the procedure so the judgment scales past her. This is the same toil-elimination instinct from identifying and eliminating toil with AI: find the repetitive, knowledge-bound work and make it legible.
Mine the artifacts you already have
Before you interview anyone, harvest the trail. Slack threads, incident postmortems, PagerDuty notes, and shell history all contain the procedure in fragments. An LLM is good at reassembling those fragments into a draft.
# Pull the raw material an expert already left lying around
sources = {
"slack": fetch_threads(channel="#etl-alerts", query="nightly job failed", days=180),
"incidents": fetch_postmortems(tag="etl"),
"shell": read_history(host="etl-runner-01", user="priya", grep="airflow|psql|s3"),
}
prompt = f"""You are drafting a runbook from messy source material.
Extract the recurring fix for the nightly ETL failures.
List the diagnostic steps, the remediation commands actually used,
and flag anything ambiguous with TODO(verify).
Sources:
{render(sources)}
Output the runbook as YAML following this schema: {SCHEMA}.
Do not invent commands. If a step is unclear, write TODO(verify: ...)."""
The “do not invent” and TODO(verify) instructions are load-bearing. A model asked to fill gaps will happily fabricate a plausible kubectl flag that doesn’t exist. You want it to leave holes, not paper over them.
Interview the expert with a chat that asks the right follow-ups
Artifact mining gets you maybe 60%. The rest lives only in Priya’s head, and the fastest way out is a structured interview. I have the model conduct it — it never gets bored, never skips the dumb clarifying question, and turns rambling answers into structure.
SYSTEM = """You are interviewing a senior engineer to capture a runbook.
Ask ONE question at a time. After each answer, ask the most useful
follow-up to remove ambiguity: exact commands, expected output,
how to know it worked, what to do if it didn't, and the back-out step.
When you have enough, output the runbook YAML for the human to correct."""
The follow-ups are where the gold is. Experts skip steps that feel automatic — “then I just restart the worker” — and a good interviewer model pins down which worker, how you confirm it came back, and what you do if it doesn’t. A tool like Claude works fine as the interview surface; the prompt workspace keeps the interview prompts versioned.
Pro Tip: Record the interview transcript verbatim alongside the runbook. Six months later when someone asks “why do we do step 4?”, the answer is in Priya’s own words, not a lossy summary.
Draft into a structured runbook schema
Free-text runbooks rot. A schema forces completeness — every step needs a verification and the whole thing needs a back-out path before it’s allowed to exist. Here’s the schema I have the model target:
# runbooks/etl-nightly-recovery.yaml
id: rb-etl-nightly-recovery
title: "Nightly ETL job failure recovery"
owner: "data-platform"
expert_source: "priya@ — interviewed 2026-06-10"
preconditions:
- "Alert: AirflowDagFailed dag_id=nightly_etl"
- "On-call has read access to the airflow-prod namespace"
diagnostics:
- step: "Check for a stuck worker pod"
cmd: "kubectl -n airflow-prod get pods -l role=worker"
expect: "All workers Running; a Pending/Error worker is the usual culprit"
remediation:
- step: "Clear the failed task instance and let it reschedule"
cmd: "airflow tasks clear nightly_etl -t transform_load -y"
verify: "airflow tasks state nightly_etl transform_load <ds> == success"
idempotent: true
- step: "If the worker is wedged, roll it"
cmd: "kubectl -n airflow-prod rollout restart deploy/airflow-worker"
verify: "kubectl -n airflow-prod rollout status deploy/airflow-worker"
idempotent: true
back_out:
- "Re-running clear is safe; if data looks double-loaded, run reconcile.sql"
escalation: "Page #data-platform if transform_load fails twice after clearing"
auto_safe: false # NOT yet approved for unattended execution
Two fields earn their keep: verify on every remediation step, and back_out at the bottom. A step you can’t verify isn’t a procedure, it’s a hope. A procedure you can’t undo isn’t safe to automate.
Turn the ratified runbook into idempotent automation
Once a human has corrected and signed off on the runbook — emphasis on ratified, not just generated — you can have the model draft automation from it. Idempotency is the rule: running it twice must be safe, because retries and overlapping alerts will run it twice.
# generated, then reviewed: etl_recovery.yml
- name: Recover nightly ETL
hosts: airflow_workers
gather_facts: false
tasks:
- name: Clear the failed transform_load task (idempotent)
ansible.builtin.command:
cmd: "airflow tasks clear nightly_etl -t transform_load -y"
register: clear_result
changed_when: "'cleared' in clear_result.stdout"
- name: Confirm task reached success before declaring victory
ansible.builtin.command:
cmd: "airflow tasks state nightly_etl transform_load {{ ds }}"
register: state
until: "'success' in state.stdout"
retries: 6
delay: 30
changed_when: false
The until block matters: the playbook doesn’t just do the fix, it confirms the fix worked, which is the part humans always remember and generated code always forgets unless you make the verify step a first-class requirement of the schema.
Never automate a procedure you only half understand
Here’s the failure I’ve watched twice. A team mines a runbook, the model produces clean-looking automation, someone schedules it to run on the alert — and nobody on the team can actually explain step 4. Then step 4’s assumptions change, the automation keeps firing, and it makes a small problem worse at machine speed.
The rule: a procedure is only eligible for unattended automation once a human who understands it has flipped auto_safe and the automation has run attended enough times to earn trust. Until then it stays a suggested runbook a human executes. And the automation runs under its own narrowly scoped service account — never the expert’s personal admin credentials, never anything that can touch prod broadly. The model that drafted the runbook should never hold those credentials at all. If you want the staged path from “suggested” to “unattended,” confidence-gated auto-remediation lays it out.
Make capture a habit, not a heroic project
The reason tribal knowledge persists is that documenting it is a one-time heroic slog nobody schedules. Lower the activation energy: at the close of every incident, have the bot draft the runbook delta from the incident channel automatically, and ask the resolver to ratify it in two minutes while it’s fresh.
def on_incident_resolved(incident):
draft = llm_draft_runbook(incident.timeline, incident.commands_run)
ask_for_ratification(incident.resolver, draft) # human edits + approves
Done at every incident, the backlog of undocumented procedures shrinks instead of growing. The automation category and our prompt packs have ready-made interview and drafting prompts if you’d rather not write them from scratch.
Conclusion
Tribal knowledge isn’t a culture problem you can fix with a stern memo about documentation; it’s a tedium problem, and tedium is what AI removes. Mine the artifacts, interview the expert, draft into a strict schema, and let a human ratify before a single line becomes unattended automation. The goal isn’t to replace Priya. It’s to make sure the team doesn’t go dark the week she’s on a beach.