Seccomp and AppArmor: Shrinking the Syscall Attack Surface

A typical container can invoke over 300 Linux syscalls. A typical web app uses maybe 60 of them. That gap — the couple hundred syscalls your app never calls but the kernel still happily accepts from it — is exactly where container escapes and kernel exploits live. Almost every serious container-breakout CVE relies on a syscall the workload had no business making.

Seccomp and AppArmor are the two kernel-level tools that close that gap. Seccomp filters which syscalls a process may make; AppArmor restricts which files, capabilities, and operations it may use. Together they shrink a compromised container from “has the full kernel API to play with” to “can do almost nothing it wasn’t built to do.” Here’s how to use them without breaking your apps.

Seccomp: start with RuntimeDefault

The single highest-leverage thing you can do is turn on the runtime’s default seccomp profile, which blocks around 60 dangerous syscalls (things like keyctl, ptrace to other processes, mount, and obscure kernel interfaces) while allowing everything a normal app needs. Astonishingly, Kubernetes historically ran pods unconfined by default, so a huge number of clusters have no seccomp filtering at all.

Turn it on at the pod level:

spec:
  securityContext:
    seccompProfile:
      type: RuntimeDefault

You can also set --seccomp-default on the kubelet to make RuntimeDefault the cluster-wide default for every pod. Do that. RuntimeDefault is well-tested, breaks almost nothing, and removes a meaningful chunk of attack surface for free. If you do one thing from this article, do this.

Custom seccomp profiles for the paranoid path

RuntimeDefault is broad. For high-value workloads you can go further with a custom profile that allows only the syscalls the app actually uses — a default-deny allowlist instead of a default-allow blocklist:

{
  "defaultAction": "SCMP_ACT_ERRNO",
  "architectures": ["SCMP_ARCH_X86_64"],
  "syscalls": [
    {
      "names": ["read", "write", "openat", "close", "fstat",
                "mmap", "brk", "rt_sigaction", "epoll_wait",
                "accept4", "futex", "nanosleep"],
      "action": "SCMP_ACT_ALLOW"
    }
  ]
}

The hard part is knowing which syscalls to allow. Don’t guess — record them. Tools like the Security Profiles Operator can run a workload in a recording mode and generate the exact seccomp profile from observed syscalls, then you apply it. That turns a tedious, error-prone task into “run the app through its paces, get a profile.” Build the profile from a real exercise of the app (including its startup, shutdown, and error paths), or the first uncommon code path will hit a blocked syscall in production.

You reference a custom profile as a localhost type:

seccompProfile:
  type: Localhost
  localhostProfile: profiles/api-restricted.json

AppArmor: restricting the rest

Seccomp governs syscalls; AppArmor governs resources — which files a process can read or write, which capabilities it holds, whether it can mount or change network config. They’re complementary. An AppArmor profile for a web app might deny all writes outside its working directory and /tmp, deny raw network access, and drop the ability to load kernel modules.

In modern Kubernetes you set it via securityContext:

spec:
  securityContext:
    appArmorProfile:
      type: Localhost
      localhostProfile: k8s-apparmor-api

A minimal profile that denies the dangerous bits while allowing the app’s real I/O:

profile k8s-apparmor-api flags=(attach_disconnected) {
  #include <abstractions/base>

  /app/** r,
  /tmp/** rw,
  deny /etc/shadow rwx,
  deny /** w,           # then allow specific writable paths above
  deny mount,
  deny /proc/sys/** w,
}

The profile must be loaded on every node where the pod might land — that’s the operational catch with AppArmor. The Security Profiles Operator helps here too by distributing profiles to nodes, so you’re not hand-loading them and hoping the scheduler cooperates.

Why both, and why it’s worth it

Defense in depth is the whole argument. Seccomp stops the syscall an exploit needs; AppArmor stops the file write or capability use it needs. An attacker who finds a way past one often runs straight into the other. Stack them with non-root execution and dropped capabilities and a compromised container is a remarkably unproductive place to be — no privileged syscalls, no writable disk, no dangerous file access, no escalation.

The classic example: many container-escape exploits need unshare, mount, or ptrace. RuntimeDefault already blocks several of those. A tight custom profile blocks the rest. The CVE that makes headlines next quarter very likely depends on a syscall your app never calls.

Rolling it out without breakage

Restrictive profiles fail closed, so a missed syscall or file path means the app breaks — roll out carefully:

Turn on RuntimeDefault everywhere first. It’s safe and high-value. Set --seccomp-default on kubelets.
Record profiles for high-value workloads using the Security Profiles Operator, exercising every real code path.
Run profiles in a complain/audit mode (AppArmor supports complain mode; log seccomp denials) before enforcing, to catch the path you missed.
Enforce per workload, watching for EPERM errors and AppArmor denials in the logs.
Keep profiles in Git and treat changes as security-critical review items — a loosened profile is a quiet downgrade.

The trap is hand-writing a tight profile from a guess and shipping it straight to enforce. Record from real behavior, audit before enforcing, and you get the hardening without the 3am “why is the app throwing EPERM” page.

More container-hardening tactics live in our security and hardening guides. When a PR swaps a profile to Unconfined or widens an AppArmor rule, the AI code review assistant helps surface the downgrade before it merges.

Profiles are illustrative. Record from real workload behavior and validate in audit/complain mode against your own apps before enforcing in production.

Seccomp and AppArmor: Shrinking the Syscall Attack Surface of Your Containers