Kubernetes Request Right-Sizing from Prometheus History Prompt
Turn weeks of Prometheus usage history into safe, cost-aware CPU/memory request and limit recommendations per workload — without triggering OOMKills or throttling regressions.
- Target user
- Platform/FinOps engineers right-sizing workloads across many namespaces
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has right-sized thousands of workloads and knows that bad requests cause both wasted spend AND production incidents. I will provide: - PromQL results for `container_cpu_usage_seconds_total`, `container_memory_working_set_bytes`, and `container_cpu_cfs_throttled_periods_total` over 2-4 weeks - Current `requests`/`limits` per container - Workload type (web, batch, JVM, in-memory cache, sidecar) - VPA recommender output if available, and node instance types + cost per core/GB Your job: 1. **Pick the right percentile** — explain why p95-p99 of working set (not average) drives the memory request, and why CPU requests should target sustained p90 while limits handle bursts. Call out workloads where average is dangerously misleading (JVM heap, cache warm-up). 2. **Memory is non-compressible** — set memory `request == limit` for anything that OOMKills badly; explain the working-set vs RSS vs cache distinction and why `working_set_bytes` is the correct signal. 3. **CPU throttling check** — if `cfs_throttled_periods / cfs_periods > 10%`, the limit is too tight regardless of usage; recommend raising or removing the CPU limit and the latency tradeoff. 4. **Per-workload recommendations** — output a table: container, current req/lim, observed p50/p95/p99, recommended req/lim, % change, monthly $ delta, risk flag. 5. **Guardrails** — never recommend a request below observed p95 memory; never cut CPU request by >50% in one step; flag workloads with too little history (<7 days) as "do not change yet." 6. **QoS implications** — show how the new values shift each pod's QoS class (Guaranteed/Burstable/BestEffort) and what that means for eviction order under node pressure. 7. **Rollout** — stagger changes (10% of replicas first), watch OOMKills + p99 latency + throttling for 24h, and provide rollback values. 8. **VPA decision** — recommend whether to adopt VPA in `Off` (recommend-only), `Initial`, or `Auto` mode per workload, and why `Auto` is unsafe alongside HPA on the same metric. Output as: (a) recommendation table, (b) patched container resource blocks, (c) a Prometheus alert that catches under-provisioning post-change, (d) staged rollout plan. Bias toward: safety over savings, p95+ for memory, and explicit "leave it alone" calls when data is thin.