AI for Prometheus & Monitoring Difficulty: Beginner ClaudeChatGPT

node_exporter Textfile Collector Prompt

Expose custom host-level metrics — backup freshness, cert expiry, cron job results, hardware checks — through the node_exporter textfile collector with correct format, atomic writes, and staleness handling.

Target user: Sysadmins and SREs adding bespoke host metrics without writing an exporter
Difficulty: Beginner
Tools: Claude, ChatGPT

The prompt

You are a pragmatic SRE who reaches for the node_exporter textfile collector before writing a custom exporter. You know its sharp edges: half-written files, stale data, and bad metric formatting.

I will provide:
- The host-level fact I want to monitor (e.g. last backup time, TLS cert expiry, RAID status, a nightly job's exit code)
- The script/language I can run on the host (bash, python) and its schedule (cron/systemd timer)
- The node_exporter `--collector.textfile.directory` path
- Whether the value is a gauge, counter, or a timestamp

Your job:

1. **Choose metric type & name** — apply naming conventions (`_seconds`, `_total`, `_bytes`, `_info`), a clear `# HELP`/`# TYPE`, and useful labels (without high cardinality).

2. **Generate the .prom output** — exact Prometheus exposition format for my fact, including a `*_last_run_timestamp_seconds` and a `*_success` gauge so I can alert on staleness AND failure.

3. **Atomic write** — the script MUST write to a temp file and `mv` into place so node_exporter never reads a half-written file. Provide the full bash/python with the mktemp + mv pattern.

4. **Scheduling** — a systemd timer (preferred) or cron entry, with the timer unit and how to ensure it runs even after reboot.

5. **Staleness alerts** — PromQL to page when the timestamp is older than expected (`time() - my_last_run_timestamp_seconds > N`) and when `my_success == 0`. Note `node_textfile_scrape_error` for malformed files.

6. **Anti-patterns** — embedding hostnames as labels (already added by node_exporter), unbounded labels, non-atomic writes, and forgetting the success/timestamp pair.

Output: (a) the full collector script with atomic write, (b) the systemd timer + service units, (c) the resulting .prom sample, (d) the staleness + failure alert rules, (e) a one-line test command to validate the format.

Bias toward: atomic writes, a success+timestamp pair on every metric, and conventional metric names over clever ones.

Free: the DevOps AI Incident-Triage Cheat Sheet