Redis Error Guide: Latency Spikes — Fork, AOF Rewrite, THP and Swap via SLOWLOG/LATENCY
Fix Redis latency spikes: diagnose fork/COW stalls, AOF rewrite, transparent hugepages, swap and slow commands via SLOWLOG and LATENCY DOCTOR.
- #redis
- #troubleshooting
- #errors
- #latency
Overview
Redis is single-threaded for command execution, so anything that stalls the main thread — a slow command, a fork() for RDB/AOF, transparent hugepages (THP) inflating copy-on-write, or the process touching swapped-out memory — shows up directly as latency spikes for all clients. There is no single “error string” here; instead you see p99/p999 latency jumps, client timeouts, and telltale entries in SLOWLOG, LATENCY LATEST, and LATENCY DOCTOR. Redis has built-in latency monitoring precisely because these stalls are subtle.
Representative signals from Redis’s own tooling:
127.0.0.1:6379> LATENCY LATEST
1) 1) "fork"
2) (integer) 1719950400
3) (integer) 812 # last fork latency in ms
4) (integer) 1140 # max fork latency in ms
127.0.0.1:6379> LATENCY DOCTOR
Dave, I have observed the system, the events sampled are 'fork'.
This might be a problem with your Linux kernel: transparent huge pages ...
The goal is to attribute the spike to a source — a slow command vs. persistence/fork vs. the OS (THP/swap) — because each has a different fix.
Symptoms
- Periodic p99/p999 latency spikes; clients occasionally time out even though throughput looks normal.
- Spikes align with RDB snapshots or AOF rewrites (
INFO persistence). SLOWLOGfills withKEYS, bigSORT,SMEMBERS,HGETALL, or largeEVAL.LATENCY DOCTORcalls outforkand warns about transparent hugepages.
redis-cli --latency-history -i 1
min: 0, max: 214, avg: 3.10 (5 samples) -- periodic 200ms spikes
Common Root Causes
1. Slow O(N) commands blocking the main thread
KEYS *, SMEMBERS on huge sets, SORT, HGETALL, unbounded LRANGE, or EVAL scanning big structures block every other command.
redis-cli SLOWLOG GET 10
1) 1) (integer) 512
2) (integer) 1719950400
3) (integer) 92341 # microseconds ~ 92ms
4) 1) "KEYS"
2) "*"
2. Fork latency during RDB/AOF (copy-on-write)
BGSAVE and AOF rewrite fork() the process; on large datasets the fork itself, and COW page faults after it, stall the main thread.
redis-cli INFO persistence | grep -E 'rdb_last_bgsave|aof_rewrite|latest_fork_usec|aof_last_bgrewrite'
redis-cli INFO stats | grep -E 'latest_fork_usec'
latest_fork_usec:812000 # 812 ms fork stall
3. Transparent Huge Pages (THP)
THP dramatically increases COW cost after fork, turning short forks into long stalls. Redis recommends disabling it.
cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
[always] is the problem — Redis wants never.
4. Swap — Redis memory paged to disk
If the process is swapped out, touching that memory faults from disk, causing large stalls.
redis-cli INFO memory | grep -E 'used_memory_rss|mem_allocator'
grep VmSwap /proc/$(pgrep -o redis-server)/status
VmSwap: 1048576 kB # Redis is swapping — bad
5. AOF appendfsync always / slow disk
appendfsync always fsyncs every write; a slow disk turns each write into a latency spike.
redis-cli CONFIG GET appendfsync
appendfsync everysec # 'always' is far slower; 'no' least safe
Diagnostic Workflow
Step 1: Measure the latency and its shape
redis-cli --latency-history -i 1 # rolling avg/max
redis-cli --intrinsic-latency 5 # baseline OS/CPU latency, no Redis load
Compare intrinsic (OS jitter) vs. observed — a big gap means Redis/workload, not the kernel scheduler alone.
Step 2: Ask Redis’s latency monitor
redis-cli CONFIG SET latency-monitor-threshold 100 # capture events >100ms
redis-cli LATENCY LATEST
redis-cli LATENCY DOCTOR
redis-cli LATENCY RESET
LATENCY LATEST names the event class (fork, command, aof-write, expire-cycle); DOCTOR suggests fixes.
Step 3: Find slow commands
redis-cli CONFIG GET slowlog-log-slower-than # microseconds threshold
redis-cli SLOWLOG GET 20
redis-cli SLOWLOG RESET
Step 4: Correlate with persistence and the OS
redis-cli INFO persistence | grep -E 'latest_fork_usec|aof_rewrite_in_progress|rdb_bgsave_in_progress'
cat /sys/kernel/mm/transparent_hugepage/enabled
grep VmSwap /proc/$(pgrep -o redis-server)/status
redis-cli INFO memory | grep -E 'used_memory_rss|mem_fragmentation_ratio'
Example Root Cause Analysis
An API saw p99 jump from 2 ms to ~200 ms roughly every 5 minutes. --latency-history confirmed periodic 200 ms spikes. LATENCY LATEST attributed them to fork:
redis-cli LATENCY LATEST
1) 1) "fork"
2) (integer) 1719950400
3) (integer) 198
4) (integer) 214
The spikes lined up exactly with BGSAVE (rdb_bgsave_in_progress and latest_fork_usec ~200 ms). Checking the kernel:
cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never
THP was always, which inflated copy-on-write costs after each fork. Disabling THP cut fork-related stalls dramatically:
echo never > /sys/kernel/mm/transparent_hugepage/enabled # + persist via GRUB/tuned
[never]
After disabling THP (and confirming VmSwap: 0), latest_fork_usec dropped and the periodic 200 ms spikes disappeared. As a follow-up, the snapshot cadence was tuned and maxmemory sized to keep RSS off swap.
Prevention Best Practices
- Disable transparent hugepages (
transparent_hugepage=never) persistently via GRUB/tuned— Redis explicitly recommends this. - Keep Redis entirely in RAM: size
maxmemorybelow available memory and setvm.overcommit_memory=1; ensureVmSwapstays 0. - Ban O(N) commands in hot paths — replace
KEYSwithSCAN, capLRANGE, avoidSMEMBERS/HGETALLon huge keys. - Prefer
appendfsync everysecoveralways; put AOF/RDB on fast disk and tune snapshot frequency to reduce fork frequency. - Turn on latency monitoring (
latency-monitor-threshold) andslowlog, and alert onlatest_fork_usecand p99. - See more Redis error guides for persistence and memory-fragmentation deep dives.
Quick Command Reference
# Measure latency
redis-cli --latency-history -i 1
redis-cli --intrinsic-latency 5
# Redis latency monitor
redis-cli CONFIG SET latency-monitor-threshold 100
redis-cli LATENCY LATEST
redis-cli LATENCY DOCTOR
# Slow commands
redis-cli SLOWLOG GET 20
redis-cli CONFIG GET slowlog-log-slower-than
# Persistence + OS correlation
redis-cli INFO persistence | grep -E 'latest_fork_usec|bgsave_in_progress|aof_rewrite_in_progress'
cat /sys/kernel/mm/transparent_hugepage/enabled
grep VmSwap /proc/$(pgrep -o redis-server)/status
Conclusion
Redis latency spikes are stalls of the single command thread, and Redis gives you the tools to attribute them. The usual sources:
- Slow O(N) commands (
KEYS, bigSORT/SMEMBERS/HGETALL, heavyEVAL) — find them inSLOWLOG. fork()for RDB/AOF and its copy-on-write cost — seelatest_fork_usecandLATENCY LATEST.- Transparent hugepages amplifying COW stalls — set THP to
never. - Swap paging Redis memory to disk — keep
VmSwapat 0. appendfsync alwayson slow storage — prefereverysec.
Use --latency-history, LATENCY DOCTOR, SLOWLOG, and INFO persistence together to pin the source, then apply the matching fix — most impactfully disabling THP and keeping Redis off swap.
Download the Free 500-Prompt DevOps AI Toolkit
500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.
- 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
- Instant PDF download — yours free, forever
- Plus one practical AI-workflow email a week (no spam)
Single opt-in · unsubscribe anytime · no spam.