Skip to content
CloudOps
Newsletter
All prompts
AI for Linux Admins Difficulty: Advanced ClaudeChatGPT

Cgroups v2 Resource Control Deep Dive Prompt

Configure cgroups v2 — memory, CPU, IO controllers; understand slice/scope/service hierarchy; isolate workloads; debug throttling and accounting.

Target user
Linux platform engineers managing resource isolation
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior Linux platform engineer who has tuned cgroups v2 for production — isolating noisy neighbors, capping memory for containers, applying I/O QoS, debugging throttle storms.

I will provide:
- The goal: cap a workload's resources / isolate / diagnose throttling / migrate from cgroups v1
- Current cgroup layout: `systemd-cgls` or `cat /proc/cgroups`
- For a target workload: which cgroup it's in (`systemctl status <unit>` shows `CGroup:`)
- Distro and systemd version (cgroups v2 requires systemd 240+ and kernel 4.5+ minimum, but 5.x recommended)
- Whether the system is unified-v2 only or hybrid v1+v2

Your job:

1. **Confirm cgroups v2 mode**:
   - `mount | grep cgroup` → look for `cgroup2 on /sys/fs/cgroup type cgroup2`
   - Unified mode is default on modern distros (Fedora 31+, Ubuntu 22.04+, RHEL 9+)
   - Hybrid mode: v1 controllers + v2 at `/sys/fs/cgroup/unified/`
   - **systemd manages cgroups**: every service is in a cgroup
2. **Understand systemd's hierarchy**:
   - **Slices** — `.slice` units; nested resource allocation pools (`system.slice`, `user.slice`, `machine.slice`)
   - **Services** — `.service` units in a slice
   - **Scopes** — `.scope` units for externally-created processes (e.g., `session-c1.scope`)
   - View: `systemctl status <slice>` or `systemd-cgls`
3. **Apply resource limits via systemd unit directives**:
   - **Memory**: `MemoryHigh=`, `MemoryMax=`, `MemoryLow=`, `MemorySwapMax=`
     - `MemoryHigh=` — soft pressure throttle; reclaim under pressure but don't OOM
     - `MemoryMax=` — hard cap; OOM kill when exceeded
     - `MemoryLow=` — protected from reclaim if usage below
   - **CPU**: `CPUWeight=` (1-10000, default 100), `CPUQuota=` (e.g., `200%` for 2 cores worth)
   - **IO**: `IOWeight=` (1-10000, default 100), `IOReadBandwidthMax=`, `IOWriteIOPSMax=`
   - **PIDs**: `TasksMax=`
4. **Apply via drop-in (preferred)**:
   - `sudo systemctl edit <unit>` opens an override editor
   - Add a `[Service]` section with the directives
   - `daemon-reload` and restart
5. **For runtime (no restart) changes**:
   - `systemctl set-property <unit> MemoryMax=8G`
   - `--runtime` for non-persistent
6. **Debugging cgroup-level pressure**:
   - **PSI (Pressure Stall Information)** — `/sys/fs/cgroup/<slice>/<unit>/memory.pressure`, `cpu.pressure`, `io.pressure`
     - Format: `some avg10=X avg60=Y avg300=Z total=N`
     - `some` = at least one task stalled; `full` = all tasks stalled
     - `avg10` > 10% = noticeable stall
   - **`memory.events`** — high, max, oom, oom_kill counters
   - **`cpu.stat`** — `usage_usec`, `nr_throttled`, `throttled_usec`
   - **`io.stat`** — per-device read/write bytes, IOPS
7. **For diagnosing OOM in a cgroup**:
   - `memory.events` shows `oom` and `oom_kill` counters
   - `memory.max` shows the limit; `memory.current` shows usage
   - `dmesg | grep -i "oom-kill"` includes the `oom_memcg=` field
8. **For migrating from v1**:
   - Kernel parameter `systemd.unified_cgroup_hierarchy=1` at boot
   - Docker pre-20.10 doesn't support v2; use 20.10+
   - kubelet supports v2 in K8s 1.25+

Mark DESTRUCTIVE: lowering MemoryMax on a running process whose current usage exceeds the new value (immediate OOM), setting `CPUWeight=` on system-critical services so low they starve, modifying `system.slice` resource caps.

---

Goal: [DESCRIBE — cap / isolate / debug / migrate]
Distro + systemd version: [DESCRIBE]
cgroup version: [`mount | grep cgroup`]
Target unit: [DESCRIBE]
Current settings: [from `systemctl show <unit>` filtering for Memory/CPU/IO]
Symptom (if debugging):
[DESCRIBE]

Why this prompt works

Cgroups v2 is the modern resource control mechanism — used by systemd, Kubernetes, Docker, and direct admins — but it’s still poorly understood. PSI metrics are diagnostic gold but most engineers don’t know they exist. This prompt walks the hierarchy and proposes specific limits.

How to use it

  1. Confirm cgroups v2 mode before applying anything. Hybrid mode has weird behavior.
  2. Apply limits at the right hierarchy level (service for one app, slice for a group).
  3. Monitor PSI after applying limits — it tells you if you’ve throttled too tight.
  4. For Kubernetes workloads, cgroups limits flow from pod resources; tune there, not on host.

Useful commands

# Verify cgroups v2 mode
mount | grep cgroup
stat -fc %T /sys/fs/cgroup/    # cgroup2fs = v2; tmpfs = v1 or hybrid

# Hierarchy view
systemd-cgls
systemd-cgls /system.slice
systemctl status <unit>        # shows CGroup: line

# Show current settings for a unit
systemctl show <unit> | grep -E "^(Memory|CPU|IO|Tasks)"

# Set a property (runtime, persistent)
sudo systemctl set-property <unit> MemoryMax=8G
sudo systemctl set-property <unit> CPUQuota=200%
sudo systemctl set-property <unit> --runtime MemoryMax=4G    # non-persistent

# Drop-in (preferred for persistent)
sudo systemctl edit <unit>
# Adds to /etc/systemd/system/<unit>.d/override.conf:
# [Service]
# MemoryMax=8G
# CPUQuota=200%

# View resource pressure (PSI)
cat /sys/fs/cgroup/<slice>/<unit>/memory.pressure
cat /sys/fs/cgroup/<slice>/<unit>/cpu.pressure
cat /sys/fs/cgroup/<slice>/<unit>/io.pressure

# Memory events (OOM history)
cat /sys/fs/cgroup/<slice>/<unit>/memory.events

# Current usage
cat /sys/fs/cgroup/<slice>/<unit>/memory.current
cat /sys/fs/cgroup/<slice>/<unit>/memory.max
cat /sys/fs/cgroup/<slice>/<unit>/cpu.stat
cat /sys/fs/cgroup/<slice>/<unit>/io.stat

# Tasks in this cgroup
cat /sys/fs/cgroup/<slice>/<unit>/cgroup.procs

# systemd-run a command in a transient unit with limits
sudo systemd-run --scope -p MemoryMax=2G -p CPUQuota=100% --slice=myslice.slice ./command

# IO weights
sudo systemctl set-property myapp.service IOWeight=200
sudo systemctl set-property myapp.service IOReadBandwidthMax="/dev/nvme0n1 100M"

Common limit patterns

Memory-cap a service (hard limit)

# systemctl edit myapp.service
[Service]
MemoryMax=4G
MemorySwapMax=0           # no swap

Memory protect-and-throttle (graceful degradation)

[Service]
MemoryHigh=3G             # throttle reclaim above this
MemoryMax=4G              # OOM at this
MemoryLow=1G              # always protect this much

CPU cap (200% = 2 cores)

[Service]
CPUQuota=200%
CPUWeight=200             # relative weight (default 100)

Restrict to a CPU set

[Service]
AllowedCPUs=0-3
AllowedMemoryNodes=0      # NUMA pin

IO QoS (per-device)

[Service]
IOWeight=200
IOReadIOPSMax=/dev/nvme0n1 10000
IOWriteIOPSMax=/dev/nvme0n1 5000
IOReadBandwidthMax=/dev/nvme0n1 100M

Slice-level pool for many services

# /etc/systemd/system/team-a.slice
[Slice]
MemoryMax=32G
CPUWeight=500

Then services join via:

# In each service file
[Service]
Slice=team-a.slice

PSI interpretation

cat /sys/fs/cgroup/system.slice/myapp.service/memory.pressure
# some avg10=12.34 avg60=5.67 avg300=2.10 total=12345678901
# full avg10=2.10 avg60=1.05 avg300=0.50 total=2345678901
  • some = AT LEAST ONE task stalled; full = ALL tasks stalled
  • avg10=12.34 = 12.34% of the last 10 seconds had stalled tasks
  • Sustained some.avg60 > 10% is a real problem
  • full.* > 0 means complete stall periods

Common findings this catches

  • App OOM-killed by cgroup but host has free RAM → MemoryMax= set too low. Either raise or fix app’s working set.
  • CPU throttling visible in cpu.pressure despite low total host CPU% → cgroup CPUQuota too tight for bursty workload.
  • memory.events shows non-zero oom_kill while app appears healthy → silent kills; correlate with restart count.
  • Service in wrong slice (e.g., system.slice when it should be in a tenant slice) → move with Slice= directive.
  • IO weights ignored → cgroups v2 io controller not enabled; check cat /sys/fs/cgroup/cgroup.controllers.
  • MemoryHigh too aggressive → app runs but slow due to constant reclaim. Either raise High or accept slower app.

When to escalate

  • Migrating production K8s cluster from cgroup v1 → v2 — major change, coordinate with kubelet+runtime+app teams.
  • Custom controllers / unusual configurations — check kernel docs; some features are recent.
  • Cross-cgroup memory accounting issues (shared mappings, page cache attribution) — kernel-level domain.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week