Linux Bonding / LACP Troubleshooting Prompt
Diagnose Linux network bonding (802.3ad LACP, active-backup, balance-tlb) — slave failures, LACP partner mismatch, throughput below sum-of-links, asymmetric traffic.
- Target user
- Linux sysadmins managing bonded NICs on servers
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux network engineer who has stood up and debugged bonded NICs on switches from Cisco, Arista, Juniper, and Mellanox. You can read `/proc/net/bonding/bond0` like a chart and tell whether LACP failed at L1 or L2. I will provide: - The bond mode (`active-backup` / `balance-tlb` / `balance-alb` / `802.3ad` / `balance-rr`) - The symptom (slave down, full bond down, throughput well below sum of links, traffic only on one link, LACP not forming) - Output of `cat /proc/net/bonding/bond0` - `ip -d link show bond0` and `ip -d link show <slave>` - `ethtool <slave>` and `ethtool -S <slave>` per slave - The switch side: LACP enabled? LAG group config? port channel summary if Cisco? - dmesg lines around bonding/LACP events Your job: 1. **Decode the bond mode-specific expectations**: - **`active-backup`** (mode 1) — only ONE slave active; failover on link down. Throughput = single-link. - **`balance-tlb`** (mode 5) — outbound balanced by load; inbound on a single slave. No switch config needed. - **`balance-alb`** (mode 6) — both directions balanced via ARP negotiation. No switch config needed. - **`802.3ad` / LACP** (mode 4) — requires switch-side LAG/port-channel. Hash-based distribution. - **`balance-rr`** (mode 0) — packet-level round-robin. Can cause TCP reordering; rare. 2. **For LACP (mode 4) failures**: - **Aggregator ID** — both slaves should be in the SAME aggregator (visible in `/proc/net/bonding/bond0`) - **Partner Mac Address** — must be non-zero (received LACPDU); zero = no LACP from switch side - **Partner Key** — must match within the aggregator - **Actor / Partner state** — LACPDUs include flags; "Activity," "Timeout," "Aggregation," "Synchronization," "Collecting," "Distributing" - All-zeros partner = switch not sending LACP, or wrong VLAN/trunk config 3. **Hash policy** (`xmit_hash_policy`): - **`layer2`** (default) — hash on MAC. Two hosts always pick the same slave. - **`layer2+3`** — MAC + IP. Adds IP differentiation; better for routed traffic. - **`layer3+4`** — IP + TCP/UDP port. Best for many flows between two hosts (same MAC). - For server-to-server bulk transfers (e.g., backups), `layer3+4` gets parallelism. Default `layer2` results in single-link throughput. 4. **For throughput < sum of links** in mode 4: - **Single flow** (one TCP connection) is hashed to ONE slave — it cannot exceed single-link speed. This is by design. - **Many flows** should distribute. If they don't, check `xmit_hash_policy` matches the workload. - Switch-side hash policy must complement (most modern switches have symmetric hashing). 5. **For slave flapping**: - `miimon` polls link state via MII; default 100ms - `arp_interval` + `arp_ip_target` polls via ARP — useful for switches that hide MII status - Confirm `link detected: yes` in `ethtool <slave>` - dmesg may show MII link toggles 6. **For `active-backup` failover delay**: - `updelay` and `downdelay` set hysteresis; defaults often 0 (immediate). Raise if seeing flap from brief blips. - `primary` option pins which slave is preferred when both up 7. **For asymmetric traffic**: - `tlb`/`alb` modes have inbound-on-one-slave property by design - LACP relies on switch's hash; check switch-side `show port-channel hash-distribution` Mark DESTRUCTIVE: changing bond mode requires bond-down (loss of all traffic), removing a slave from a single-slave bond. --- Bond mode: [DESCRIBE] Symptom: [DESCRIBE — throughput target vs actual, slave count, switch model] `cat /proc/net/bonding/bond0`: ``` [PASTE] ``` `ip -d link show bond0` and each slave: ``` [PASTE] ``` Per-slave `ethtool <slave>` and `ethtool -S <slave>` highlights: ``` [PASTE] ``` Switch-side config or `show etherchannel summary`: ``` [PASTE] ``` Recent dmesg (bond/LACP): ``` [PASTE] ```
Why this prompt works
Bonding failures often look like “the link is fine” while throughput is half. The mode determines what’s possible — active-backup will never beat one-link throughput regardless of switch — and LACP debugging requires reading partner state precisely. This prompt forces a mode-aware diagnosis.
How to use it
- State the bond mode upfront. Diagnosis differs.
- Always include
/proc/net/bonding/bond<N>— it has the LACP partner detail, aggregator ID, slave status. - Include the switch side when LACP is involved. The bond can’t form alone.
- For throughput problems, identify how many flows are involved. Single-flow over LACP cannot exceed one link.
Useful commands
# Bond status (most informative)
cat /proc/net/bonding/bond0
ip -d link show bond0
ip -d link show <slave>
# Per-slave NIC
ethtool <slave> # link, speed, duplex
ethtool -S <slave> # extended (drops, errors)
ethtool -k <slave> # offload features
ethtool -i <slave> # driver
# LACP-specific in /proc/net/bonding/bond0:
# - "Actor Mac address" — should be a real MAC
# - "Partner Mac Address" — should be the switch's MAC (NOT 00:00:00:00:00:00)
# - "Aggregator ID" — all working slaves in the same aggregator
# - "Actor/Partner key" — should match
# - "Actor/Partner Port State" — should be 0x3D or similar (Activity, Timeout, Aggregation, Sync, Collecting, Distributing)
# Detail
sudo ip -s link show bond0 # statistics
sudo ip -s link show <slave>
# Add / remove slaves dynamically
sudo ip link set <slave> down
sudo ip link set <slave> nomaster # remove from bond
sudo ip link set <slave> master bond0 # add to bond
# Config files (Ubuntu/Debian netplan)
cat /etc/netplan/*.yaml
# RHEL/CentOS NetworkManager
sudo nmcli connection show
sudo nmcli connection show bond0
sudo nmcli connection modify bond0 bond.options "mode=802.3ad,miimon=100,xmit_hash_policy=layer3+4"
# Test throughput (multi-flow for LACP)
iperf3 -c <server> -P 8 -t 30
# dmesg
dmesg | grep -E "bond|802.3ad|802.1Q" | tail -50
Config patterns
Active-backup (simple, no switch config needed)
# Netplan
network:
version: 2
bonds:
bond0:
interfaces: [eth0, eth1]
addresses: [192.168.1.10/24]
gateway4: 192.168.1.1
parameters:
mode: active-backup
primary: eth0
mii-monitor-interval: 100
up-delay: 200
down-delay: 200
802.3ad LACP (requires switch LAG)
network:
version: 2
bonds:
bond0:
interfaces: [eth0, eth1]
addresses: [192.168.1.10/24]
parameters:
mode: 802.3ad
transmit-hash-policy: layer3+4
lacp-rate: fast # 1s LACPDU vs default 30s
mii-monitor-interval: 100
NetworkManager (nmcli)
sudo nmcli connection add type bond con-name bond0 ifname bond0 \
bond.options "mode=802.3ad,miimon=100,xmit_hash_policy=layer3+4,lacp_rate=fast"
sudo nmcli connection add type ethernet con-name slave-eth0 ifname eth0 master bond0
sudo nmcli connection add type ethernet con-name slave-eth1 ifname eth1 master bond0
sudo nmcli connection up bond0
Common findings this catches
Partner Mac Address: 00:00:00:00:00:00→ switch not sending LACP (wrong port, wrong mode). Confirm switch config.Aggregator IDdiffers between slaves → only one slave in the active aggregator (others can’t join — usually speed/duplex mismatch). Checkethtool <slave>.xmit_hash_policy: layer2with all traffic to one destination MAC → all traffic hashed to one slave. Switch tolayer3+4.- Single iperf3 over LACP shows 1 Gbps on a 4×1 Gbps bond — by design (single flow → one link). Test with
-P 8. Mode: balance-rrand TCP retransmits high → packet reordering; switch to LACP ortlb.- slave shows
link detected: yesbut bond says it’s down →miimonissue; try ARP monitoring (arp_ip_target). - MTU mismatch between slaves → packets get dropped silently; set MTU on bond and slaves.
Mode selection cheatsheet
| Goal | Mode | Switch config? |
|---|---|---|
| Simple failover | active-backup (1) | No |
| Outbound load distribution, no switch config | balance-tlb (5) | No |
| In + out, no switch config | balance-alb (6) | No |
| Standardized link aggregation | 802.3ad LACP (4) | Yes (LAG/port-channel) |
| Maximum throughput single flow | None — bonding can’t exceed single link per flow | — |
When to escalate
- Switch-side LACP not forming despite correct partner key/state — pull in network team; usually a switch config issue.
- Asymmetric traffic causing throughput cap that LACP shouldn’t have — check switch’s hash distribution; may need re-hash.
- Driver-specific issues (e.g., MLX or BNX2X under specific kernel versions) — driver upgrade or firmware update; check vendor advisories.
Related prompts
-
Linux Host Network Connectivity Debug Prompt
Diagnose single-host Linux networking — broken routes, firewall blocks, DNS, conntrack exhaustion, ephemeral port exhaustion, MTU issues — without confusing it with cloud/SDN problems.
-
Linux Network Performance Tuning Prompt
Diagnose slow network throughput, high latency, retransmits, ephemeral port exhaustion, and tune TCP/UDP stack parameters (BBR, buffers, queues) safely.
-
Linux VLAN & Bridge Troubleshooting Prompt
Diagnose Linux bridge and VLAN issues — tagged/untagged traffic confusion, bridge fdb mysteries, vlan_filtering, VXLAN overlay debugging.