Skip to content
CloudOps
Newsletter
All prompts
AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Linux Bonding / LACP Troubleshooting Prompt

Diagnose Linux network bonding (802.3ad LACP, active-backup, balance-tlb) — slave failures, LACP partner mismatch, throughput below sum-of-links, asymmetric traffic.

Target user
Linux sysadmins managing bonded NICs on servers
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Linux network engineer who has stood up and debugged bonded NICs on switches from Cisco, Arista, Juniper, and Mellanox. You can read `/proc/net/bonding/bond0` like a chart and tell whether LACP failed at L1 or L2.

I will provide:
- The bond mode (`active-backup` / `balance-tlb` / `balance-alb` / `802.3ad` / `balance-rr`)
- The symptom (slave down, full bond down, throughput well below sum of links, traffic only on one link, LACP not forming)
- Output of `cat /proc/net/bonding/bond0`
- `ip -d link show bond0` and `ip -d link show <slave>`
- `ethtool <slave>` and `ethtool -S <slave>` per slave
- The switch side: LACP enabled? LAG group config? port channel summary if Cisco?
- dmesg lines around bonding/LACP events

Your job:

1. **Decode the bond mode-specific expectations**:
   - **`active-backup`** (mode 1) — only ONE slave active; failover on link down. Throughput = single-link.
   - **`balance-tlb`** (mode 5) — outbound balanced by load; inbound on a single slave. No switch config needed.
   - **`balance-alb`** (mode 6) — both directions balanced via ARP negotiation. No switch config needed.
   - **`802.3ad` / LACP** (mode 4) — requires switch-side LAG/port-channel. Hash-based distribution.
   - **`balance-rr`** (mode 0) — packet-level round-robin. Can cause TCP reordering; rare.
2. **For LACP (mode 4) failures**:
   - **Aggregator ID** — both slaves should be in the SAME aggregator (visible in `/proc/net/bonding/bond0`)
   - **Partner Mac Address** — must be non-zero (received LACPDU); zero = no LACP from switch side
   - **Partner Key** — must match within the aggregator
   - **Actor / Partner state** — LACPDUs include flags; "Activity," "Timeout," "Aggregation," "Synchronization," "Collecting," "Distributing"
   - All-zeros partner = switch not sending LACP, or wrong VLAN/trunk config
3. **Hash policy** (`xmit_hash_policy`):
   - **`layer2`** (default) — hash on MAC. Two hosts always pick the same slave.
   - **`layer2+3`** — MAC + IP. Adds IP differentiation; better for routed traffic.
   - **`layer3+4`** — IP + TCP/UDP port. Best for many flows between two hosts (same MAC).
   - For server-to-server bulk transfers (e.g., backups), `layer3+4` gets parallelism. Default `layer2` results in single-link throughput.
4. **For throughput < sum of links** in mode 4:
   - **Single flow** (one TCP connection) is hashed to ONE slave — it cannot exceed single-link speed. This is by design.
   - **Many flows** should distribute. If they don't, check `xmit_hash_policy` matches the workload.
   - Switch-side hash policy must complement (most modern switches have symmetric hashing).
5. **For slave flapping**:
   - `miimon` polls link state via MII; default 100ms
   - `arp_interval` + `arp_ip_target` polls via ARP — useful for switches that hide MII status
   - Confirm `link detected: yes` in `ethtool <slave>`
   - dmesg may show MII link toggles
6. **For `active-backup` failover delay**:
   - `updelay` and `downdelay` set hysteresis; defaults often 0 (immediate). Raise if seeing flap from brief blips.
   - `primary` option pins which slave is preferred when both up
7. **For asymmetric traffic**:
   - `tlb`/`alb` modes have inbound-on-one-slave property by design
   - LACP relies on switch's hash; check switch-side `show port-channel hash-distribution`

Mark DESTRUCTIVE: changing bond mode requires bond-down (loss of all traffic), removing a slave from a single-slave bond.

---

Bond mode: [DESCRIBE]
Symptom: [DESCRIBE — throughput target vs actual, slave count, switch model]
`cat /proc/net/bonding/bond0`:
```
[PASTE]
```
`ip -d link show bond0` and each slave:
```
[PASTE]
```
Per-slave `ethtool <slave>` and `ethtool -S <slave>` highlights:
```
[PASTE]
```
Switch-side config or `show etherchannel summary`:
```
[PASTE]
```
Recent dmesg (bond/LACP):
```
[PASTE]
```

Why this prompt works

Bonding failures often look like “the link is fine” while throughput is half. The mode determines what’s possible — active-backup will never beat one-link throughput regardless of switch — and LACP debugging requires reading partner state precisely. This prompt forces a mode-aware diagnosis.

How to use it

  1. State the bond mode upfront. Diagnosis differs.
  2. Always include /proc/net/bonding/bond<N> — it has the LACP partner detail, aggregator ID, slave status.
  3. Include the switch side when LACP is involved. The bond can’t form alone.
  4. For throughput problems, identify how many flows are involved. Single-flow over LACP cannot exceed one link.

Useful commands

# Bond status (most informative)
cat /proc/net/bonding/bond0
ip -d link show bond0
ip -d link show <slave>

# Per-slave NIC
ethtool <slave>                    # link, speed, duplex
ethtool -S <slave>                 # extended (drops, errors)
ethtool -k <slave>                 # offload features
ethtool -i <slave>                 # driver

# LACP-specific in /proc/net/bonding/bond0:
# - "Actor Mac address" — should be a real MAC
# - "Partner Mac Address" — should be the switch's MAC (NOT 00:00:00:00:00:00)
# - "Aggregator ID" — all working slaves in the same aggregator
# - "Actor/Partner key" — should match
# - "Actor/Partner Port State" — should be 0x3D or similar (Activity, Timeout, Aggregation, Sync, Collecting, Distributing)

# Detail
sudo ip -s link show bond0          # statistics
sudo ip -s link show <slave>

# Add / remove slaves dynamically
sudo ip link set <slave> down
sudo ip link set <slave> nomaster   # remove from bond
sudo ip link set <slave> master bond0  # add to bond

# Config files (Ubuntu/Debian netplan)
cat /etc/netplan/*.yaml
# RHEL/CentOS NetworkManager
sudo nmcli connection show
sudo nmcli connection show bond0
sudo nmcli connection modify bond0 bond.options "mode=802.3ad,miimon=100,xmit_hash_policy=layer3+4"

# Test throughput (multi-flow for LACP)
iperf3 -c <server> -P 8 -t 30

# dmesg
dmesg | grep -E "bond|802.3ad|802.1Q" | tail -50

Config patterns

Active-backup (simple, no switch config needed)

# Netplan
network:
  version: 2
  bonds:
    bond0:
      interfaces: [eth0, eth1]
      addresses: [192.168.1.10/24]
      gateway4: 192.168.1.1
      parameters:
        mode: active-backup
        primary: eth0
        mii-monitor-interval: 100
        up-delay: 200
        down-delay: 200

802.3ad LACP (requires switch LAG)

network:
  version: 2
  bonds:
    bond0:
      interfaces: [eth0, eth1]
      addresses: [192.168.1.10/24]
      parameters:
        mode: 802.3ad
        transmit-hash-policy: layer3+4
        lacp-rate: fast        # 1s LACPDU vs default 30s
        mii-monitor-interval: 100

NetworkManager (nmcli)

sudo nmcli connection add type bond con-name bond0 ifname bond0 \
    bond.options "mode=802.3ad,miimon=100,xmit_hash_policy=layer3+4,lacp_rate=fast"
sudo nmcli connection add type ethernet con-name slave-eth0 ifname eth0 master bond0
sudo nmcli connection add type ethernet con-name slave-eth1 ifname eth1 master bond0
sudo nmcli connection up bond0

Common findings this catches

  • Partner Mac Address: 00:00:00:00:00:00 → switch not sending LACP (wrong port, wrong mode). Confirm switch config.
  • Aggregator ID differs between slaves → only one slave in the active aggregator (others can’t join — usually speed/duplex mismatch). Check ethtool <slave>.
  • xmit_hash_policy: layer2 with all traffic to one destination MAC → all traffic hashed to one slave. Switch to layer3+4.
  • Single iperf3 over LACP shows 1 Gbps on a 4×1 Gbps bond — by design (single flow → one link). Test with -P 8.
  • Mode: balance-rr and TCP retransmits high → packet reordering; switch to LACP or tlb.
  • slave shows link detected: yes but bond says it’s downmiimon issue; try ARP monitoring (arp_ip_target).
  • MTU mismatch between slaves → packets get dropped silently; set MTU on bond and slaves.

Mode selection cheatsheet

GoalModeSwitch config?
Simple failoveractive-backup (1)No
Outbound load distribution, no switch configbalance-tlb (5)No
In + out, no switch configbalance-alb (6)No
Standardized link aggregation802.3ad LACP (4)Yes (LAG/port-channel)
Maximum throughput single flowNone — bonding can’t exceed single link per flow

When to escalate

  • Switch-side LACP not forming despite correct partner key/state — pull in network team; usually a switch config issue.
  • Asymmetric traffic causing throughput cap that LACP shouldn’t have — check switch’s hash distribution; may need re-hash.
  • Driver-specific issues (e.g., MLX or BNX2X under specific kernel versions) — driver upgrade or firmware update; check vendor advisories.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week