You are a senior Linux platform engineer who has tuned cgroups v2 for production — isolating noisy neighbors, capping memory for containers, applying I/O QoS, debugging throttle storms. I will provide: - The goal: cap a workload's resources / isolate / diagnose throttling / migrate from cgroups v1 - Current cgroup layout: `systemd-cgls` or `cat /proc/cgroups` - For a target workload: which cgroup it's in (`systemctl status <unit>` shows `CGroup:`) - Distro and systemd version (cgroups v2 requires systemd 240+ and kernel 4.5+ minimum, but 5.x recommended) - Whether the system is unified-v2 only or hybrid v1+v2 Your job: 1. **Confirm cgroups v2 mode**: - `mount | grep cgroup` → look for `cgroup2 on /sys/fs/cgroup type cgroup2` - Unified mode is default on modern distros (Fedora 31+, Ubuntu 22.04+, RHEL 9+) - Hybrid mode: v1 controllers + v2 at `/sys/fs/cgroup/unified/` - **systemd manages cgroups**: every service is in a cgroup 2. **Understand systemd's hierarchy**: - **Slices** — `.slice` units; nested resource allocation pools (`system.slice`, `user.slice`, `machine.slice`) - **Services** — `.service` units in a slice - **Scopes** — `.scope` units for externally-created processes (e.g., `session-c1.scope`) - View: `systemctl status <slice>` or `systemd-cgls` 3. **Apply resource limits via systemd unit directives**: - **Memory**: `MemoryHigh=`, `MemoryMax=`, `MemoryLow=`, `MemorySwapMax=` - `MemoryHigh=` — soft pressure throttle; reclaim under pressure but don't OOM - `MemoryMax=` — hard cap; OOM kill when exceeded - `MemoryLow=` — protected from reclaim if usage below - **CPU**: `CPUWeight=` (1-10000, default 100), `CPUQuota=` (e.g., `200%` for 2 cores worth) - **IO**: `IOWeight=` (1-10000, default 100), `IOReadBandwidthMax=`, `IOWriteIOPSMax=` - **PIDs**: `TasksMax=` 4. **Apply via drop-in (preferred)**: - `sudo systemctl edit <unit>` opens an override editor - Add a `[Service]` section with the directives - `daemon-reload` and restart 5. **For runtime (no restart) changes**: - `systemctl set-property <unit> MemoryMax=8G` - `--runtime` for non-persistent 6. **Debugging cgroup-level pressure**: - **PSI (Pressure Stall Information)** — `/sys/fs/cgroup/<slice>/<unit>/memory.pressure`, `cpu.pressure`, `io.pressure` - Format: `some avg10=X avg60=Y avg300=Z total=N` - `some` = at least one task stalled; `full` = all tasks stalled - `avg10` > 10% = noticeable stall - **`memory.events`** — high, max, oom, oom_kill counters - **`cpu.stat`** — `usage_usec`, `nr_throttled`, `throttled_usec` - **`io.stat`** — per-device read/write bytes, IOPS 7. **For diagnosing OOM in a cgroup**: - `memory.events` shows `oom` and `oom_kill` counters - `memory.max` shows the limit; `memory.current` shows usage - `dmesg | grep -i "oom-kill"` includes the `oom_memcg=` field 8. **For migrating from v1**: - Kernel parameter `systemd.unified_cgroup_hierarchy=1` at boot - Docker pre-20.10 doesn't support v2; use 20.10+ - kubelet supports v2 in K8s 1.25+ Mark DESTRUCTIVE: lowering MemoryMax on a running process whose current usage exceeds the new value (immediate OOM), setting `CPUWeight=` on system-critical services so low they starve, modifying `system.slice` resource caps. --- Goal: [DESCRIBE — cap / isolate / debug / migrate] Distro + systemd version: [DESCRIBE] cgroup version: [`mount | grep cgroup`] Target unit: [DESCRIBE] Current settings: [from `systemctl show <unit>` filtering for Memory/CPU/IO] Symptom (if debugging): [DESCRIBE]

Why this prompt works

Cgroups v2 is the modern resource control mechanism — used by systemd, Kubernetes, Docker, and direct admins — but it’s still poorly understood. PSI metrics are diagnostic gold but most engineers don’t know they exist. This prompt walks the hierarchy and proposes specific limits.

How to use it

Confirm cgroups v2 mode before applying anything. Hybrid mode has weird behavior.
Apply limits at the right hierarchy level (service for one app, slice for a group).
Monitor PSI after applying limits — it tells you if you’ve throttled too tight.
For Kubernetes workloads, cgroups limits flow from pod resources; tune there, not on host.

Useful commands

# Verify cgroups v2 mode
mount | grep cgroup
stat -fc %T /sys/fs/cgroup/    # cgroup2fs = v2; tmpfs = v1 or hybrid

# Hierarchy view
systemd-cgls
systemd-cgls /system.slice
systemctl status <unit>        # shows CGroup: line

# Show current settings for a unit
systemctl show <unit> | grep -E "^(Memory|CPU|IO|Tasks)"

# Set a property (runtime, persistent)
sudo systemctl set-property <unit> MemoryMax=8G
sudo systemctl set-property <unit> CPUQuota=200%
sudo systemctl set-property <unit> --runtime MemoryMax=4G    # non-persistent

# Drop-in (preferred for persistent)
sudo systemctl edit <unit>
# Adds to /etc/systemd/system/<unit>.d/override.conf:
# [Service]
# MemoryMax=8G
# CPUQuota=200%

# View resource pressure (PSI)
cat /sys/fs/cgroup/<slice>/<unit>/memory.pressure
cat /sys/fs/cgroup/<slice>/<unit>/cpu.pressure
cat /sys/fs/cgroup/<slice>/<unit>/io.pressure

# Memory events (OOM history)
cat /sys/fs/cgroup/<slice>/<unit>/memory.events

# Current usage
cat /sys/fs/cgroup/<slice>/<unit>/memory.current
cat /sys/fs/cgroup/<slice>/<unit>/memory.max
cat /sys/fs/cgroup/<slice>/<unit>/cpu.stat
cat /sys/fs/cgroup/<slice>/<unit>/io.stat

# Tasks in this cgroup
cat /sys/fs/cgroup/<slice>/<unit>/cgroup.procs

# systemd-run a command in a transient unit with limits
sudo systemd-run --scope -p MemoryMax=2G -p CPUQuota=100% --slice=myslice.slice ./command

# IO weights
sudo systemctl set-property myapp.service IOWeight=200
sudo systemctl set-property myapp.service IOReadBandwidthMax="/dev/nvme0n1 100M"

Common limit patterns

Memory-cap a service (hard limit)

# systemctl edit myapp.service
[Service]
MemoryMax=4G
MemorySwapMax=0           # no swap

Memory protect-and-throttle (graceful degradation)

[Service]
MemoryHigh=3G             # throttle reclaim above this
MemoryMax=4G              # OOM at this
MemoryLow=1G              # always protect this much

CPU cap (200% = 2 cores)

[Service]
CPUQuota=200%
CPUWeight=200             # relative weight (default 100)

Restrict to a CPU set

[Service]
AllowedCPUs=0-3
AllowedMemoryNodes=0      # NUMA pin

IO QoS (per-device)

[Service]
IOWeight=200
IOReadIOPSMax=/dev/nvme0n1 10000
IOWriteIOPSMax=/dev/nvme0n1 5000
IOReadBandwidthMax=/dev/nvme0n1 100M

Slice-level pool for many services

# /etc/systemd/system/team-a.slice
[Slice]
MemoryMax=32G
CPUWeight=500

Then services join via:

# In each service file
[Service]
Slice=team-a.slice

PSI interpretation

cat /sys/fs/cgroup/system.slice/myapp.service/memory.pressure
# some avg10=12.34 avg60=5.67 avg300=2.10 total=12345678901
# full avg10=2.10 avg60=1.05 avg300=0.50 total=2345678901

some = AT LEAST ONE task stalled; full = ALL tasks stalled
avg10=12.34 = 12.34% of the last 10 seconds had stalled tasks
Sustained some.avg60 > 10% is a real problem
full.* > 0 means complete stall periods

Common findings this catches

App OOM-killed by cgroup but host has free RAM → MemoryMax= set too low. Either raise or fix app’s working set.
CPU throttling visible in cpu.pressure despite low total host CPU% → cgroup CPUQuota too tight for bursty workload.
memory.events shows non-zero oom_kill while app appears healthy → silent kills; correlate with restart count.
Service in wrong slice (e.g., system.slice when it should be in a tenant slice) → move with Slice= directive.
IO weights ignored → cgroups v2 io controller not enabled; check cat /sys/fs/cgroup/cgroup.controllers.
MemoryHigh too aggressive → app runs but slow due to constant reclaim. Either raise High or accept slower app.

When to escalate

Migrating production K8s cluster from cgroup v1 → v2 — major change, coordinate with kubelet+runtime+app teams.
Custom controllers / unusual configurations — check kernel docs; some features are recent.
Cross-cgroup memory accounting issues (shared mappings, page cache attribution) — kernel-level domain.

Cgroups v2 Resource Control Deep Dive Prompt

Why this prompt works

How to use it

Useful commands

Common limit patterns

Memory-cap a service (hard limit)

Memory protect-and-throttle (graceful degradation)

CPU cap (200% = 2 cores)

Restrict to a CPU set

IO QoS (per-device)

Slice-level pool for many services

PSI interpretation

Common findings this catches

When to escalate

Related prompts

Kubernetes Resource Limits & OOMKilled Tuning Prompt

Linux OOM Kill & Memory Pressure Investigation Prompt

systemd Unit Failure Debugging Prompt

Why this prompt works

How to use it

Useful commands

Common limit patterns

Memory-cap a service (hard limit)

Memory protect-and-throttle (graceful degradation)

CPU cap (200% = 2 cores)

Restrict to a CPU set

IO QoS (per-device)

Slice-level pool for many services

PSI interpretation

Common findings this catches

When to escalate

Related prompts

Kubernetes Resource Limits & OOMKilled Tuning Prompt

Linux OOM Kill & Memory Pressure Investigation Prompt

systemd Unit Failure Debugging Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet