Skip to content
DevOps AI ToolKit
Newsletter
All guides
AI for Redis By James Joyner IV · · 10 min read

Redis Error Guide: Latency Spikes — Fork, AOF Rewrite, THP and Swap via SLOWLOG/LATENCY

Fix Redis latency spikes: diagnose fork/COW stalls, AOF rewrite, transparent hugepages, swap and slow commands via SLOWLOG and LATENCY DOCTOR.

  • #redis
  • #troubleshooting
  • #errors
  • #latency

Overview

Redis is single-threaded for command execution, so anything that stalls the main thread — a slow command, a fork() for RDB/AOF, transparent hugepages (THP) inflating copy-on-write, or the process touching swapped-out memory — shows up directly as latency spikes for all clients. There is no single “error string” here; instead you see p99/p999 latency jumps, client timeouts, and telltale entries in SLOWLOG, LATENCY LATEST, and LATENCY DOCTOR. Redis has built-in latency monitoring precisely because these stalls are subtle.

Representative signals from Redis’s own tooling:

127.0.0.1:6379> LATENCY LATEST
1) 1) "fork"
   2) (integer) 1719950400
   3) (integer) 812          # last fork latency in ms
   4) (integer) 1140         # max fork latency in ms
127.0.0.1:6379> LATENCY DOCTOR
Dave, I have observed the system, the events sampled are 'fork'.
This might be a problem with your Linux kernel: transparent huge pages ...

The goal is to attribute the spike to a source — a slow command vs. persistence/fork vs. the OS (THP/swap) — because each has a different fix.

Symptoms

  • Periodic p99/p999 latency spikes; clients occasionally time out even though throughput looks normal.
  • Spikes align with RDB snapshots or AOF rewrites (INFO persistence).
  • SLOWLOG fills with KEYS, big SORT, SMEMBERS, HGETALL, or large EVAL.
  • LATENCY DOCTOR calls out fork and warns about transparent hugepages.
redis-cli --latency-history -i 1
min: 0, max: 214, avg: 3.10 (5 samples) -- periodic 200ms spikes

Common Root Causes

1. Slow O(N) commands blocking the main thread

KEYS *, SMEMBERS on huge sets, SORT, HGETALL, unbounded LRANGE, or EVAL scanning big structures block every other command.

redis-cli SLOWLOG GET 10
1) 1) (integer) 512
   2) (integer) 1719950400
   3) (integer) 92341        # microseconds ~ 92ms
   4) 1) "KEYS"
      2) "*"

2. Fork latency during RDB/AOF (copy-on-write)

BGSAVE and AOF rewrite fork() the process; on large datasets the fork itself, and COW page faults after it, stall the main thread.

redis-cli INFO persistence | grep -E 'rdb_last_bgsave|aof_rewrite|latest_fork_usec|aof_last_bgrewrite'
redis-cli INFO stats | grep -E 'latest_fork_usec'
latest_fork_usec:812000     # 812 ms fork stall

3. Transparent Huge Pages (THP)

THP dramatically increases COW cost after fork, turning short forks into long stalls. Redis recommends disabling it.

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

[always] is the problem — Redis wants never.

4. Swap — Redis memory paged to disk

If the process is swapped out, touching that memory faults from disk, causing large stalls.

redis-cli INFO memory | grep -E 'used_memory_rss|mem_allocator'
grep VmSwap /proc/$(pgrep -o redis-server)/status
VmSwap:  1048576 kB   # Redis is swapping — bad

5. AOF appendfsync always / slow disk

appendfsync always fsyncs every write; a slow disk turns each write into a latency spike.

redis-cli CONFIG GET appendfsync
appendfsync everysec   # 'always' is far slower; 'no' least safe

Diagnostic Workflow

Step 1: Measure the latency and its shape

redis-cli --latency-history -i 1     # rolling avg/max
redis-cli --intrinsic-latency 5      # baseline OS/CPU latency, no Redis load

Compare intrinsic (OS jitter) vs. observed — a big gap means Redis/workload, not the kernel scheduler alone.

Step 2: Ask Redis’s latency monitor

redis-cli CONFIG SET latency-monitor-threshold 100   # capture events >100ms
redis-cli LATENCY LATEST
redis-cli LATENCY DOCTOR
redis-cli LATENCY RESET

LATENCY LATEST names the event class (fork, command, aof-write, expire-cycle); DOCTOR suggests fixes.

Step 3: Find slow commands

redis-cli CONFIG GET slowlog-log-slower-than   # microseconds threshold
redis-cli SLOWLOG GET 20
redis-cli SLOWLOG RESET

Step 4: Correlate with persistence and the OS

redis-cli INFO persistence | grep -E 'latest_fork_usec|aof_rewrite_in_progress|rdb_bgsave_in_progress'
cat /sys/kernel/mm/transparent_hugepage/enabled
grep VmSwap /proc/$(pgrep -o redis-server)/status
redis-cli INFO memory | grep -E 'used_memory_rss|mem_fragmentation_ratio'

Example Root Cause Analysis

An API saw p99 jump from 2 ms to ~200 ms roughly every 5 minutes. --latency-history confirmed periodic 200 ms spikes. LATENCY LATEST attributed them to fork:

redis-cli LATENCY LATEST
1) 1) "fork"
   2) (integer) 1719950400
   3) (integer) 198
   4) (integer) 214

The spikes lined up exactly with BGSAVE (rdb_bgsave_in_progress and latest_fork_usec ~200 ms). Checking the kernel:

cat /sys/kernel/mm/transparent_hugepage/enabled
[always] madvise never

THP was always, which inflated copy-on-write costs after each fork. Disabling THP cut fork-related stalls dramatically:

echo never > /sys/kernel/mm/transparent_hugepage/enabled   # + persist via GRUB/tuned
[never]

After disabling THP (and confirming VmSwap: 0), latest_fork_usec dropped and the periodic 200 ms spikes disappeared. As a follow-up, the snapshot cadence was tuned and maxmemory sized to keep RSS off swap.

Prevention Best Practices

  • Disable transparent hugepages (transparent_hugepage=never) persistently via GRUB/tuned — Redis explicitly recommends this.
  • Keep Redis entirely in RAM: size maxmemory below available memory and set vm.overcommit_memory=1; ensure VmSwap stays 0.
  • Ban O(N) commands in hot paths — replace KEYS with SCAN, cap LRANGE, avoid SMEMBERS/HGETALL on huge keys.
  • Prefer appendfsync everysec over always; put AOF/RDB on fast disk and tune snapshot frequency to reduce fork frequency.
  • Turn on latency monitoring (latency-monitor-threshold) and slowlog, and alert on latest_fork_usec and p99.
  • See more Redis error guides for persistence and memory-fragmentation deep dives.

Quick Command Reference

# Measure latency
redis-cli --latency-history -i 1
redis-cli --intrinsic-latency 5

# Redis latency monitor
redis-cli CONFIG SET latency-monitor-threshold 100
redis-cli LATENCY LATEST
redis-cli LATENCY DOCTOR

# Slow commands
redis-cli SLOWLOG GET 20
redis-cli CONFIG GET slowlog-log-slower-than

# Persistence + OS correlation
redis-cli INFO persistence | grep -E 'latest_fork_usec|bgsave_in_progress|aof_rewrite_in_progress'
cat /sys/kernel/mm/transparent_hugepage/enabled
grep VmSwap /proc/$(pgrep -o redis-server)/status

Conclusion

Redis latency spikes are stalls of the single command thread, and Redis gives you the tools to attribute them. The usual sources:

  1. Slow O(N) commands (KEYS, big SORT/SMEMBERS/HGETALL, heavy EVAL) — find them in SLOWLOG.
  2. fork() for RDB/AOF and its copy-on-write cost — see latest_fork_usec and LATENCY LATEST.
  3. Transparent hugepages amplifying COW stalls — set THP to never.
  4. Swap paging Redis memory to disk — keep VmSwap at 0.
  5. appendfsync always on slow storage — prefer everysec.

Use --latency-history, LATENCY DOCTOR, SLOWLOG, and INFO persistence together to pin the source, then apply the matching fix — most impactfully disabling THP and keeping Redis off swap.

Free download · 368-page PDF

Download the Free 500-Prompt DevOps AI Toolkit

500 battle-tested, copy-paste AI prompts engineered by a senior systems engineer — every one with fill-in placeholders and safety/back-out notes. Drop your email and it's yours.

  • 500 prompts: Linux · Kubernetes · Terraform · OpenStack · GitLab · Docker · Monitoring · Incident Response
  • Instant PDF download — yours free, forever
  • Plus one practical AI-workflow email a week (no spam)

Single opt-in · unsubscribe anytime · no spam.