Debugging Cloud Run and Cloud Functions With AI

A Cloud Run deploy kept failing with “The user-provided container failed to start and listen on the port defined by the PORT environment variable.” The app worked perfectly on my laptop. It worked in Docker locally. It just wouldn’t start on Cloud Run. The answer was that the container hardcoded port 8080 in one place and read an env var in another, and Cloud Run was setting PORT to something else entirely during a revision. Serverless on GCP has a strict, mostly-undocumented container contract, and when you violate it the platform gives you a generic error that describes the symptom, not the cause. That gap between symptom and cause is exactly where I lean on AI.

The Cloud Run container contract

Most “container failed to start” errors are one of a handful of contract violations: not listening on $PORT, listening on 127.0.0.1 instead of 0.0.0.0, taking too long to bind the port, or crashing on a missing env var. I paste the startup logs and the Dockerfile and let the model match the symptom to the contract.

gcloud run services logs read my-api --region=us-central1 --limit=50

Prompt: “This Cloud Run revision failed with ‘container failed to start and listen on PORT.’ Here are the last 50 log lines and the Dockerfile. Cloud Run requires the app to listen on 0.0.0.0:$PORT within the startup timeout. Tell me which specific contract rule is being violated and the exact code or Dockerfile change to fix it. Don’t suggest changes unrelated to startup.”

The model spotted that the app bound to localhost, which inside a container means the loopback interface Cloud Run’s health probe can’t reach. One-line fix, but a non-obvious one if you don’t know the contract — and the generic error never mentions the interface.

# Wrong: only reachable from inside the container
# CMD ["gunicorn", "--bind", "127.0.0.1:8080", "app:app"]

# Right: bind all interfaces and honor the injected PORT
CMD exec gunicorn --bind "0.0.0.0:${PORT}" app:app

Cold starts and timeouts

When a service is slow or timing out under load, the question is usually whether it’s cold-start latency or in-request slowness. The request logs distinguish them, and AI is good at reading the pattern:

Prompt: “Here are Cloud Run request logs (JSON) with httpRequest.latency and the instanceId label. Separate cold-start requests (first request to a new instance) from warm ones. Tell me the latency distribution for each, and whether I should fix this with min-instances, more CPU, or by speeding up app initialization. Show the gcloud command for whichever you recommend.”

If it’s genuinely cold starts hurting a latency-sensitive endpoint, the fix is keeping instances warm — but that costs money, so I want the model to tell me it’s actually cold starts before I pay for it:

gcloud run services update my-api --region=us-central1 \
  --min-instances=1 --cpu=2 --cpu-boost

--cpu-boost gives extra CPU during startup specifically, which often fixes cold starts cheaper than pinning a min instance. AI knowing that distinction is the difference between a $2/month fix and a $40/month one.

IAM: invoker permissions vs. the runtime identity

Two completely different IAM problems look similar on serverless. One: the caller can’t invoke the service (missing roles/run.invoker). Two: the service’s own runtime service account can’t reach a downstream resource. The error surfaces at different layers and AI helps me tell them apart fast.

# Who can invoke
gcloud run services get-iam-policy my-api --region=us-central1

# What identity the service runs as
gcloud run services describe my-api --region=us-central1 \
  --format="value(spec.template.spec.serviceAccountName)"

Prompt: “A Cloud Run service returns 403 to its caller sometimes, and 403 from inside the handler when it calls Pub/Sub other times. I’ve pasted the invoker IAM policy and the runtime service account name. Help me distinguish the two cases from the log signature, and give the exact gcloud binding for each. Use the runtime SA for the downstream grant, not the invoker.”

That last instruction prevents the classic mistake of granting Pub/Sub access to the human caller instead of the service’s runtime identity.

Cloud Functions: same ideas, gen2 quirks

Gen2 Cloud Functions run on Cloud Run under the hood, so the contract and IAM lessons carry over. The extra failure mode is the trigger. For an event-driven function that isn’t firing, I dump the trigger config and the Eventarc plumbing:

gcloud functions describe process-upload --gen2 --region=us-central1 \
  --format="yaml(eventTrigger, serviceConfig.serviceAccountEmail)"

Prompt: “This gen2 Cloud Function should fire on a Cloud Storage finalize event but doesn’t. Here’s the eventTrigger config and the service account. Eventarc needs the Pub/Sub and Eventarc service agents to have specific roles, and the trigger SA needs roles/eventarc.eventReceiver. Check which grant is missing and give me the command.”

The honest division of labor

AI is fast at matching a generic serverless error to the specific contract rule or IAM binding behind it, because those rules are well-defined even when they’re poorly surfaced in the error text. What it can’t do is see your live traffic, your bill, or your latency SLO — so it tells me which class of fix applies and I decide whether the trade-off is worth it. I never let it pick min-instances without seeing that the latency is real and cold-start-shaped.

These prompts live in reusable form in my prompts library, and the GCP with AI series covers the IAM and networking layers a serverless incident tends to pull in. The platform’s errors are generic on purpose; your debugging doesn’t have to be.

The Cloud Run container contract

Cold starts and timeouts

IAM: invoker permissions vs. the runtime identity

Cloud Functions: same ideas, gen2 quirks

The honest division of labor

Download the Free 500-Prompt DevOps AI Toolkit