Cgroups v2 Resource Control Deep Dive Prompt
Configure cgroups v2 — memory, CPU, IO controllers; understand slice/scope/service hierarchy; isolate workloads; debug throttling and accounting.
- Target user
- Linux platform engineers managing resource isolation
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior Linux platform engineer who has tuned cgroups v2 for production — isolating noisy neighbors, capping memory for containers, applying I/O QoS, debugging throttle storms.
I will provide:
- The goal: cap a workload's resources / isolate / diagnose throttling / migrate from cgroups v1
- Current cgroup layout: `systemd-cgls` or `cat /proc/cgroups`
- For a target workload: which cgroup it's in (`systemctl status <unit>` shows `CGroup:`)
- Distro and systemd version (cgroups v2 requires systemd 240+ and kernel 4.5+ minimum, but 5.x recommended)
- Whether the system is unified-v2 only or hybrid v1+v2
Your job:
1. **Confirm cgroups v2 mode**:
- `mount | grep cgroup` → look for `cgroup2 on /sys/fs/cgroup type cgroup2`
- Unified mode is default on modern distros (Fedora 31+, Ubuntu 22.04+, RHEL 9+)
- Hybrid mode: v1 controllers + v2 at `/sys/fs/cgroup/unified/`
- **systemd manages cgroups**: every service is in a cgroup
2. **Understand systemd's hierarchy**:
- **Slices** — `.slice` units; nested resource allocation pools (`system.slice`, `user.slice`, `machine.slice`)
- **Services** — `.service` units in a slice
- **Scopes** — `.scope` units for externally-created processes (e.g., `session-c1.scope`)
- View: `systemctl status <slice>` or `systemd-cgls`
3. **Apply resource limits via systemd unit directives**:
- **Memory**: `MemoryHigh=`, `MemoryMax=`, `MemoryLow=`, `MemorySwapMax=`
- `MemoryHigh=` — soft pressure throttle; reclaim under pressure but don't OOM
- `MemoryMax=` — hard cap; OOM kill when exceeded
- `MemoryLow=` — protected from reclaim if usage below
- **CPU**: `CPUWeight=` (1-10000, default 100), `CPUQuota=` (e.g., `200%` for 2 cores worth)
- **IO**: `IOWeight=` (1-10000, default 100), `IOReadBandwidthMax=`, `IOWriteIOPSMax=`
- **PIDs**: `TasksMax=`
4. **Apply via drop-in (preferred)**:
- `sudo systemctl edit <unit>` opens an override editor
- Add a `[Service]` section with the directives
- `daemon-reload` and restart
5. **For runtime (no restart) changes**:
- `systemctl set-property <unit> MemoryMax=8G`
- `--runtime` for non-persistent
6. **Debugging cgroup-level pressure**:
- **PSI (Pressure Stall Information)** — `/sys/fs/cgroup/<slice>/<unit>/memory.pressure`, `cpu.pressure`, `io.pressure`
- Format: `some avg10=X avg60=Y avg300=Z total=N`
- `some` = at least one task stalled; `full` = all tasks stalled
- `avg10` > 10% = noticeable stall
- **`memory.events`** — high, max, oom, oom_kill counters
- **`cpu.stat`** — `usage_usec`, `nr_throttled`, `throttled_usec`
- **`io.stat`** — per-device read/write bytes, IOPS
7. **For diagnosing OOM in a cgroup**:
- `memory.events` shows `oom` and `oom_kill` counters
- `memory.max` shows the limit; `memory.current` shows usage
- `dmesg | grep -i "oom-kill"` includes the `oom_memcg=` field
8. **For migrating from v1**:
- Kernel parameter `systemd.unified_cgroup_hierarchy=1` at boot
- Docker pre-20.10 doesn't support v2; use 20.10+
- kubelet supports v2 in K8s 1.25+
Mark DESTRUCTIVE: lowering MemoryMax on a running process whose current usage exceeds the new value (immediate OOM), setting `CPUWeight=` on system-critical services so low they starve, modifying `system.slice` resource caps.
---
Goal: [DESCRIBE — cap / isolate / debug / migrate]
Distro + systemd version: [DESCRIBE]
cgroup version: [`mount | grep cgroup`]
Target unit: [DESCRIBE]
Current settings: [from `systemctl show <unit>` filtering for Memory/CPU/IO]
Symptom (if debugging):
[DESCRIBE]
Why this prompt works
Cgroups v2 is the modern resource control mechanism — used by systemd, Kubernetes, Docker, and direct admins — but it’s still poorly understood. PSI metrics are diagnostic gold but most engineers don’t know they exist. This prompt walks the hierarchy and proposes specific limits.
How to use it
- Confirm cgroups v2 mode before applying anything. Hybrid mode has weird behavior.
- Apply limits at the right hierarchy level (service for one app, slice for a group).
- Monitor PSI after applying limits — it tells you if you’ve throttled too tight.
- For Kubernetes workloads, cgroups limits flow from pod resources; tune there, not on host.
Useful commands
# Verify cgroups v2 mode
mount | grep cgroup
stat -fc %T /sys/fs/cgroup/ # cgroup2fs = v2; tmpfs = v1 or hybrid
# Hierarchy view
systemd-cgls
systemd-cgls /system.slice
systemctl status <unit> # shows CGroup: line
# Show current settings for a unit
systemctl show <unit> | grep -E "^(Memory|CPU|IO|Tasks)"
# Set a property (runtime, persistent)
sudo systemctl set-property <unit> MemoryMax=8G
sudo systemctl set-property <unit> CPUQuota=200%
sudo systemctl set-property <unit> --runtime MemoryMax=4G # non-persistent
# Drop-in (preferred for persistent)
sudo systemctl edit <unit>
# Adds to /etc/systemd/system/<unit>.d/override.conf:
# [Service]
# MemoryMax=8G
# CPUQuota=200%
# View resource pressure (PSI)
cat /sys/fs/cgroup/<slice>/<unit>/memory.pressure
cat /sys/fs/cgroup/<slice>/<unit>/cpu.pressure
cat /sys/fs/cgroup/<slice>/<unit>/io.pressure
# Memory events (OOM history)
cat /sys/fs/cgroup/<slice>/<unit>/memory.events
# Current usage
cat /sys/fs/cgroup/<slice>/<unit>/memory.current
cat /sys/fs/cgroup/<slice>/<unit>/memory.max
cat /sys/fs/cgroup/<slice>/<unit>/cpu.stat
cat /sys/fs/cgroup/<slice>/<unit>/io.stat
# Tasks in this cgroup
cat /sys/fs/cgroup/<slice>/<unit>/cgroup.procs
# systemd-run a command in a transient unit with limits
sudo systemd-run --scope -p MemoryMax=2G -p CPUQuota=100% --slice=myslice.slice ./command
# IO weights
sudo systemctl set-property myapp.service IOWeight=200
sudo systemctl set-property myapp.service IOReadBandwidthMax="/dev/nvme0n1 100M"
Common limit patterns
Memory-cap a service (hard limit)
# systemctl edit myapp.service
[Service]
MemoryMax=4G
MemorySwapMax=0 # no swap
Memory protect-and-throttle (graceful degradation)
[Service]
MemoryHigh=3G # throttle reclaim above this
MemoryMax=4G # OOM at this
MemoryLow=1G # always protect this much
CPU cap (200% = 2 cores)
[Service]
CPUQuota=200%
CPUWeight=200 # relative weight (default 100)
Restrict to a CPU set
[Service]
AllowedCPUs=0-3
AllowedMemoryNodes=0 # NUMA pin
IO QoS (per-device)
[Service]
IOWeight=200
IOReadIOPSMax=/dev/nvme0n1 10000
IOWriteIOPSMax=/dev/nvme0n1 5000
IOReadBandwidthMax=/dev/nvme0n1 100M
Slice-level pool for many services
# /etc/systemd/system/team-a.slice
[Slice]
MemoryMax=32G
CPUWeight=500
Then services join via:
# In each service file
[Service]
Slice=team-a.slice
PSI interpretation
cat /sys/fs/cgroup/system.slice/myapp.service/memory.pressure
# some avg10=12.34 avg60=5.67 avg300=2.10 total=12345678901
# full avg10=2.10 avg60=1.05 avg300=0.50 total=2345678901
some= AT LEAST ONE task stalled;full= ALL tasks stalledavg10=12.34= 12.34% of the last 10 seconds had stalled tasks- Sustained
some.avg60 > 10%is a real problem full.*> 0 means complete stall periods
Common findings this catches
- App OOM-killed by cgroup but host has free RAM →
MemoryMax=set too low. Either raise or fix app’s working set. - CPU throttling visible in
cpu.pressuredespite low total host CPU% → cgroup CPUQuota too tight for bursty workload. memory.eventsshows non-zerooom_killwhile app appears healthy → silent kills; correlate with restart count.- Service in wrong slice (e.g., system.slice when it should be in a tenant slice) → move with
Slice=directive. - IO weights ignored → cgroups v2
iocontroller not enabled; checkcat /sys/fs/cgroup/cgroup.controllers. MemoryHightoo aggressive → app runs but slow due to constant reclaim. Either raise High or accept slower app.
When to escalate
- Migrating production K8s cluster from cgroup v1 → v2 — major change, coordinate with kubelet+runtime+app teams.
- Custom controllers / unusual configurations — check kernel docs; some features are recent.
- Cross-cgroup memory accounting issues (shared mappings, page cache attribution) — kernel-level domain.
Related prompts
-
Kubernetes Resource Limits & OOMKilled Tuning Prompt
Tune CPU/memory requests and limits to stop OOMKilled, fix throttling, right-size HPA targets, and avoid noisy-neighbor scheduling issues.
-
Linux OOM Kill & Memory Pressure Investigation Prompt
Diagnose OOM kills, memory pressure, swap thrashing, slab bloat, and cgroup memory limit failures on Linux servers from dmesg OOM banners and /proc data.
-
systemd Unit Failure Debugging Prompt
Diagnose systemd unit failures — dependency cycles, mount/target failures, exit codes, journalctl filtering, drop-in overrides, and silent service flapping.