Redis Error Guide: Latency Spikes — Fork, AOF Rewrite, THP

Overview

Redis is single-threaded for command execution, so anything that stalls the main thread — a slow command, a fork() for RDB/AOF, transparent hugepages (THP) inflating copy-on-write, or the process touching swapped-out memory — shows up directly as latency spikes for all clients. There is no single “error string” here; instead you see p99/p999 latency jumps, client timeouts, and telltale entries in SLOWLOG, LATENCY LATEST, and LATENCY DOCTOR. Redis has built-in latency monitoring precisely because these stalls are subtle.

Representative signals from Redis’s own tooling:

127.0.0.1:6379> LATENCY LATEST
1) 1) "fork"
   2) (integer) 1719950400
   3) (integer) 812          # last fork latency in ms
   4) (integer) 1140         # max fork latency in ms

127.0.0.1:6379> LATENCY DOCTOR
Dave, I have observed the system, the events sampled are 'fork'.
This might be a problem with your Linux kernel: transparent huge pages ...

The goal is to attribute the spike to a source — a slow command vs. persistence/fork vs. the OS (THP/swap) — because each has a different fix.

Symptoms

Periodic p99/p999 latency spikes; clients occasionally time out even though throughput looks normal.
Spikes align with RDB snapshots or AOF rewrites (INFO persistence).
SLOWLOG fills with KEYS, big SORT, SMEMBERS, HGETALL, or large EVAL.
LATENCY DOCTOR calls out fork and warns about transparent hugepages.

redis-cli --latency-history -i 1

min: 0, max: 214, avg: 3.10 (5 samples) -- periodic 200ms spikes

Common Root Causes

1. Slow O(N) commands blocking the main thread

KEYS *, SMEMBERS on huge sets, SORT, HGETALL, unbounded LRANGE, or EVAL scanning big structures block every other command.

redis-cli SLOWLOG GET 10

1) 1) (integer) 512
   2) (integer) 1719950400
   3) (integer) 92341        # microseconds ~ 92ms
   4) 1) "KEYS"
      2) "*"

2. Fork latency during RDB/AOF (copy-on-write)

BGSAVE and AOF rewrite fork() the process; on large datasets the fork itself, and COW page faults after it, stall the main thread.

redis-cli INFO persistence | grep -E 'rdb_last_bgsave|aof_rewrite|latest_fork_usec|aof_last_bgrewrite'
redis-cli INFO stats | grep -E 'latest_fork_usec'

latest_fork_usec:812000     # 812 ms fork stall

3. Transparent Huge Pages (THP)

THP dramatically increases COW cost after fork, turning short forks into long stalls. Redis recommends disabling it.

cat /sys/kernel/mm/transparent_hugepage/enabled

[always] madvise never

[always] is the problem — Redis wants never.

4. Swap — Redis memory paged to disk

If the process is swapped out, touching that memory faults from disk, causing large stalls.

redis-cli INFO memory | grep -E 'used_memory_rss|mem_allocator'
grep VmSwap /proc/$(pgrep -o redis-server)/status

VmSwap:  1048576 kB   # Redis is swapping — bad

5. AOF `appendfsync always` / slow disk

appendfsync always fsyncs every write; a slow disk turns each write into a latency spike.

redis-cli CONFIG GET appendfsync

appendfsync everysec   # 'always' is far slower; 'no' least safe

Diagnostic Workflow

Step 1: Measure the latency and its shape

redis-cli --latency-history -i 1     # rolling avg/max
redis-cli --intrinsic-latency 5      # baseline OS/CPU latency, no Redis load

Compare intrinsic (OS jitter) vs. observed — a big gap means Redis/workload, not the kernel scheduler alone.

Step 2: Ask Redis’s latency monitor

redis-cli CONFIG SET latency-monitor-threshold 100   # capture events >100ms
redis-cli LATENCY LATEST
redis-cli LATENCY DOCTOR
redis-cli LATENCY RESET

LATENCY LATEST names the event class (fork, command, aof-write, expire-cycle); DOCTOR suggests fixes.

Step 3: Find slow commands

redis-cli CONFIG GET slowlog-log-slower-than   # microseconds threshold
redis-cli SLOWLOG GET 20
redis-cli SLOWLOG RESET

Step 4: Correlate with persistence and the OS

redis-cli INFO persistence | grep -E 'latest_fork_usec|aof_rewrite_in_progress|rdb_bgsave_in_progress'
cat /sys/kernel/mm/transparent_hugepage/enabled
grep VmSwap /proc/$(pgrep -o redis-server)/status
redis-cli INFO memory | grep -E 'used_memory_rss|mem_fragmentation_ratio'

Example Root Cause Analysis

An API saw p99 jump from 2 ms to ~200 ms roughly every 5 minutes. --latency-history confirmed periodic 200 ms spikes. LATENCY LATEST attributed them to fork:

redis-cli LATENCY LATEST

1) 1) "fork"
   2) (integer) 1719950400
   3) (integer) 198
   4) (integer) 214

The spikes lined up exactly with BGSAVE (rdb_bgsave_in_progress and latest_fork_usec ~200 ms). Checking the kernel:

cat /sys/kernel/mm/transparent_hugepage/enabled

[always] madvise never

THP was always, which inflated copy-on-write costs after each fork. Disabling THP cut fork-related stalls dramatically:

echo never > /sys/kernel/mm/transparent_hugepage/enabled   # + persist via GRUB/tuned

[never]

After disabling THP (and confirming VmSwap: 0), latest_fork_usec dropped and the periodic 200 ms spikes disappeared. As a follow-up, the snapshot cadence was tuned and maxmemory sized to keep RSS off swap.

Prevention Best Practices

Disable transparent hugepages (transparent_hugepage=never) persistently via GRUB/tuned — Redis explicitly recommends this.
Keep Redis entirely in RAM: size maxmemory below available memory and set vm.overcommit_memory=1; ensure VmSwap stays 0.
Ban O(N) commands in hot paths — replace KEYS with SCAN, cap LRANGE, avoid SMEMBERS/HGETALL on huge keys.
Prefer appendfsync everysec over always; put AOF/RDB on fast disk and tune snapshot frequency to reduce fork frequency.
Turn on latency monitoring (latency-monitor-threshold) and slowlog, and alert on latest_fork_usec and p99.
See more Redis error guides for persistence and memory-fragmentation deep dives.

Quick Command Reference

# Measure latency
redis-cli --latency-history -i 1
redis-cli --intrinsic-latency 5

# Redis latency monitor
redis-cli CONFIG SET latency-monitor-threshold 100
redis-cli LATENCY LATEST
redis-cli LATENCY DOCTOR

# Slow commands
redis-cli SLOWLOG GET 20
redis-cli CONFIG GET slowlog-log-slower-than

# Persistence + OS correlation
redis-cli INFO persistence | grep -E 'latest_fork_usec|bgsave_in_progress|aof_rewrite_in_progress'
cat /sys/kernel/mm/transparent_hugepage/enabled
grep VmSwap /proc/$(pgrep -o redis-server)/status

Conclusion

Redis latency spikes are stalls of the single command thread, and Redis gives you the tools to attribute them. The usual sources:

Slow O(N) commands (KEYS, big SORT/SMEMBERS/HGETALL, heavy EVAL) — find them in SLOWLOG.
fork() for RDB/AOF and its copy-on-write cost — see latest_fork_usec and LATENCY LATEST.
Transparent hugepages amplifying COW stalls — set THP to never.
Swap paging Redis memory to disk — keep VmSwap at 0.
appendfsync always on slow storage — prefer everysec.

Use --latency-history, LATENCY DOCTOR, SLOWLOG, and INFO persistence together to pin the source, then apply the matching fix — most impactfully disabling THP and keeping Redis off swap.

Redis Error Guide: Latency Spikes — Fork, AOF Rewrite, THP and Swap via SLOWLOG/LATENCY

Overview

Symptoms

Common Root Causes

1. Slow O(N) commands blocking the main thread

2. Fork latency during RDB/AOF (copy-on-write)

3. Transparent Huge Pages (THP)

4. Swap — Redis memory paged to disk

5. AOF `appendfsync always` / slow disk

Diagnostic Workflow

Step 1: Measure the latency and its shape

Step 2: Ask Redis’s latency monitor

Step 3: Find slow commands

Step 4: Correlate with persistence and the OS

Example Root Cause Analysis

Prevention Best Practices

Quick Command Reference

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit

Overview

Symptoms

Common Root Causes

1. Slow O(N) commands blocking the main thread

2. Fork latency during RDB/AOF (copy-on-write)

3. Transparent Huge Pages (THP)

4. Swap — Redis memory paged to disk

5. AOF appendfsync always / slow disk

Diagnostic Workflow

Step 1: Measure the latency and its shape

Step 2: Ask Redis’s latency monitor

Step 3: Find slow commands

Step 4: Correlate with persistence and the OS

Example Root Cause Analysis

Prevention Best Practices

Quick Command Reference

Conclusion

Download the Free 500-Prompt DevOps AI Toolkit

5. AOF `appendfsync always` / slow disk