You are a senior cloud engineer who has tuned dozens of Cloud Run services where the fix was a startup probe, a min-instance setting, or a concurrency value — not a code rewrite. You read the revision config and the logs before you touch anything. I will provide: - Service/function type and the symptom: cold-start latency, "container failed to start", timeouts, or 5xx: [SYMPTOM] - The revision config: [`gcloud run services describe SERVICE --format=yaml`] — CPU, memory, concurrency, min/max instances, timeout, startup probe - Relevant logs: [CLOUD LOGGING ENTRIES — startup, request, and error logs] - The container's startup behavior: what it does before listening on $PORT, and image size - Traffic shape: bursty vs steady, p50/p95 latency if known Your job: 1. **Separate cold start from cold code** — distinguish slow container startup (image pull, app init, listening late on $PORT) from slow per-request work. Cite the log timestamps that prove which one it is. 2. **Startup correctness** — if the container fails to start, check it listens on the $PORT env var (not a hardcoded port), within the startup timeout, and that the startup probe (if any) is configured right. Give the exact fix. 3. **Tune the revision** — recommend concrete settings: min instances to absorb cold starts, CPU allocation (always-on vs request-time), concurrency per instance, memory, and request timeout. Justify each against the symptom and the cost it adds. 4. **Failure triage** — for 5xx/timeouts, separate app errors from platform limits (memory OOM, 60s/3600s timeout, concurrency saturation) using the logs, and point to the specific limit hit. 5. **Roll out safely** — propose deploying a new revision with traffic splitting (e.g. 10% canary) and the `gcloud run deploy` / `gcloud run services update-traffic` commands, with a rollback to the previous revision. Output: (a) cold-start-vs-runtime verdict with log evidence, (b) the specific root cause, (c) the exact config/command changes with their cost impact, (d) a canary rollout + rollback plan. Bias toward the smallest config change that fixes the symptom and toward canary rollouts. Show me the new revision settings before I deploy.

Why this prompt works

Serverless debugging goes wrong when engineers treat every symptom as a code bug and redeploy repeatedly, when the real cause is a revision setting — a container listening on the wrong port, concurrency set too high, or min instances at zero on a latency-sensitive service. This prompt makes the first move be a distinction: cold start versus cold code, proven from log timestamps. That single split routes the rest of the investigation correctly.

Each branch maps to how Cloud Run and Cloud Functions actually fail. Startup failures are almost always the $PORT contract or the startup timeout; latency is usually image size, app init, or a missing min instance; and 5xx errors are often platform limits (OOM, request timeout, concurrency saturation) masquerading as app errors. By asking the model to tie each conclusion to log evidence and to a specific limit, the output stays diagnostic rather than speculative.

The cost and safety guardrails are deliberate. Min instances and always-on CPU fix latency but bill continuously, so the prompt requires the cost trade-off to be stated. And every change ships as a new revision with traffic splitting and a rollback, so the human stays in control and production never sees an untested revision at full traffic.

Cloud Run & Cloud Functions Cold-Start & Failure Debug Prompt

Why this prompt works

Related prompts

GCP Cloud Logging & Monitoring MQL Query Builder Prompt

Why this prompt works

Related prompts

GCP Cloud Logging & Monitoring MQL Query Builder Prompt

Free: the DevOps AI Incident-Triage Cheat Sheet