Set Operations in Bash: comm, join, and sort for Inventory

Inventory drift is a permanent fact of operations. The cloud provider knows about 412 instances, your CMDB lists 408, your monitoring agent reports 415, and your Ansible inventory has 401. Somewhere in those gaps live untracked machines, decommissioned hosts that still page, and the one box nobody is watching. The instinct is to write a Python script with a couple of sets. But for the common case of “compare two lists of identifiers,” Bash already ships everything you need: sort, comm, and join. They are POSIX, they are everywhere, and they handle files larger than memory because they stream.

I still write Python when the reconciliation involves real joins across structured records. But for the daily question of “what is in list A that is not in list B,” reaching for comm is faster to write, faster to run, and trivially auditable. This guide covers the set operations, the one rule you absolutely must follow, and how I let AI draft the field-juggling parts while keeping verification in my own hands.

The One Rule: Everything Must Be Sorted

Both comm and join assume their inputs are sorted on the comparison key. This is not a suggestion. If the inputs are not sorted, you get silently wrong results, not an error. So step one for every workflow below is to normalize and sort.

# Pull clean, sorted host lists from each source of truth
aws ec2 describe-instances \
  --query 'Reservations[].Instances[].PrivateDnsName' --output text \
  | tr '\t' '\n' | sort -u > /tmp/cloud.txt

awk '{print $1}' /etc/ansible/hosts | sort -u > /tmp/ansible.txt

The sort -u both sorts and removes duplicates, which is what you want for set semantics. Use a consistent locale (LC_ALL=C sort) if you care about byte-order stability across machines, especially when comparing files generated on different systems.

comm: The Three-Way Set Comparison

comm compares two sorted files and prints three columns: lines unique to file 1, lines unique to file 2, and lines common to both.

comm /tmp/cloud.txt /tmp/ansible.txt

By default all three columns print. The real power is suppressing columns with -1, -2, -3 (the number is the column to hide). This turns comm into a clean set-difference and intersection tool:

# In cloud but NOT in Ansible -> unmanaged hosts
comm -23 /tmp/cloud.txt /tmp/ansible.txt

# In Ansible but NOT in cloud -> stale inventory entries (host is gone)
comm -13 /tmp/cloud.txt /tmp/ansible.txt

# In both -> correctly tracked hosts (the intersection)
comm -12 /tmp/cloud.txt /tmp/ansible.txt

Read the flags as “suppress these columns.” comm -23 suppresses columns 2 and 3, leaving only column 1: lines unique to the first file. This is the canonical way to compute A minus B in the shell, and it is the workhorse of inventory reconciliation. Wrap it in a tiny report:

#!/usr/bin/env bash
set -euo pipefail

cloud=/tmp/cloud.txt
ansible=/tmp/ansible.txt

printf '=== Unmanaged (in cloud, not in Ansible) ===\n'
comm -23 "$cloud" "$ansible"

printf '\n=== Stale (in Ansible, host gone from cloud) ===\n'
comm -13 "$cloud" "$ansible"

printf '\n=== Tracked correctly: %d hosts ===\n' \
  "$(comm -12 "$cloud" "$ansible" | wc -l)"

That single script answers the two questions that actually matter during an audit: what are we not managing, and what are we managing that no longer exists.

join: When Each Line Carries More Than an Identifier

comm works on whole lines. When each record has a key plus extra fields, you want join, which performs a relational join on a shared key column.

Say you have a file of hostname role and another of hostname owner, both sorted on the hostname:

sort -k1,1 roles.txt > /tmp/roles.sorted
sort -k1,1 owners.txt > /tmp/owners.sorted

# Inner join on column 1: only hosts present in BOTH files
join -1 1 -2 1 /tmp/roles.sorted /tmp/owners.sorted
# host01 web alice
# host02 db   bob

-1 1 -2 1 says join on field 1 of file 1 and field 1 of file 2. By default join is an inner join, emitting only matched keys. To find unmatched records, the trick is -a (output unpairable lines from the named file) combined with -o and -e to format the gaps:

# Hosts that have a role but no recorded owner
join -1 1 -2 1 -a 1 -e 'NO-OWNER' -o '1.1 1.2 2.2' \
  /tmp/roles.sorted /tmp/owners.sorted \
  | awk '$3 == "NO-OWNER"'

-a 1 keeps unpaired lines from file 1, -e 'NO-OWNER' fills missing fields, and -o '1.1 1.2 2.2' specifies the exact output layout. That field-spec syntax is fiddly, which brings us to where AI earns its keep.

Where AI Drafts and Where You Verify

The comm flag direction and especially the join -o field specifiers are easy to get backwards. I let an assistant draft the invocation and then I verify it against tiny known inputs. When I am reconciling a messy inventory I lean on the drift detection prompt for the broader workflow, and for the raw shell field-juggling the config diff and safe merge prompt is a good drafting partner.

A representative exchange:

Prompt: Two sorted files of hostnames. Give me the hosts in prod.txt that are missing from monitored.txt.

Response: That is a set difference, A minus B:
comm -23 <(sort -u prod.txt) <(sort -u monitored.txt)
-23 suppresses columns 2 and 3, leaving only lines unique to the first file. The <(sort -u ...) re-sorts defensively in case the inputs were not already ordered.

Correct, and the defensive re-sort is a nice touch. But I never take a reconciliation result on faith, because a wrong comm flag produces a plausible-looking but inverted answer. So I verify with a known fixture before pointing it at real data:

# Verification with tiny known inputs
printf 'a\nb\nc\n' | sort > /tmp/A
printf 'b\nc\nd\n' | sort > /tmp/B

echo "A minus B (expect: a)"
comm -23 /tmp/A /tmp/B

echo "B minus A (expect: d)"
comm -13 /tmp/A /tmp/B

echo "intersection (expect: b c)"
comm -12 /tmp/A /tmp/B

If those three print a, then d, then b and c, your flag direction is correct and you can trust the same command on the real lists. AI drafts, human verifies, every time.

A Note on Counting and Sanity Checks

Before acting on any diff, sanity-check the magnitudes. A reconciliation that suddenly reports 300 unmanaged hosts probably means one input failed to populate, not that 300 machines appeared overnight.

printf 'cloud=%d ansible=%d unmanaged=%d stale=%d\n' \
  "$(wc -l < /tmp/cloud.txt)" \
  "$(wc -l < /tmp/ansible.txt)" \
  "$(comm -23 /tmp/cloud.txt /tmp/ansible.txt | wc -l)" \
  "$(comm -13 /tmp/cloud.txt /tmp/ansible.txt | wc -l)"

If either source count is zero or implausible, stop before you automate any remediation off the result.

Takeaways

For comparing lists of identifiers, comm gives you difference and intersection in one command, join handles records with extra fields, and sort -u is the non-negotiable prerequisite for both. You rarely need Python for this. Let AI draft the flag combinations and the join -o field specs, then verify against a tiny fixture with a known answer before you let the output drive any change.

Find more shell-first reconciliation patterns under Bash and Python automation.

Set Operations in Bash: comm, join, and sort for Inventory Reconciliation