AI for GitLab CI/CD Difficulty: Advanced ClaudeChatGPT

GitLab CI/CD Self-Hosted Runner Spot Instance Cost Prompt

Run GitLab CI jobs on AWS/GCP spot instances with autoscaling runners to cut compute cost, while handling preemption, caching, and job interruption so reliability stays intact.

Target user: Platform/infra engineers managing self-hosted runner fleets and cost
Difficulty: Advanced
Tools: Claude, ChatGPT

The prompt

You are a senior infrastructure engineer who runs GitLab CI on spot/preemptible capacity, cutting runner cost 60-80% while keeping pipelines reliable despite interruptions.

I will provide:
- My current runner setup (executor, cloud, instance types, autoscaler)
- Monthly runner spend and job volume/profile
- Tolerance for retries and which jobs are interruption-safe
- Caching setup (S3/GCS object cache, registry)

Your job:

1. **Spot suitability** — classify my jobs into spot-safe (idempotent build/test, re-runnable) vs spot-unsafe (deploys, release publishing, stateful migrations) and recommend a tag-based split so risky jobs land on on-demand runners.

2. **Autoscaler choice** — compare the modern options for elastic spot capacity: GitLab Runner Autoscaler with the **Fleeting** plugin (AWS ASG / GCP MIG), or a K8s executor on a spot node pool with Karpenter/cluster-autoscaler. Recommend based on my profile and give the core config.

3. **Fleeting + ASG config** — show the `[runners.autoscaler]` block: capacity-per-instance, idle scaling, `max_use_count`, and the cloud ASG/MIG set to spot with a diversified instance pool and an on-demand fallback for resilience.

4. **Survive preemption** — make jobs resilient: `retry: { max: 2, when: [runner_system_failure, stuck_or_timeout_failure] }`, short jobs over long ones, frequent checkpoint to object cache so a re-run is fast, and `interruptible: true` only where safe.

5. **Caching to absorb churn** — distributed cache on S3/GCS (not local disk, which dies with the instance), warm base images via a registry pull-through cache, so a fresh spot node is productive in seconds.

6. **Cost model** — estimate savings from my spend: spot discount × spot-eligible job share, minus retry overhead, plus on-demand fallback cost. Give a target spot/on-demand ratio and the metrics to watch (preemption rate, retry rate, queue time).

7. **Guardrails** — alert if preemption or retry rate spikes (capacity crunch), and an automatic shift toward on-demand when spot is unavailable so pipelines don't stall.

Output as: (a) job tag split spot vs on-demand, (b) the autoscaler + ASG/MIG config, (c) retry/interruptible settings, (d) distributed cache config, (e) a savings estimate and the alerts to add.

Bias toward: deploys on stable capacity, idempotent re-runnable jobs on spot, distributed caching, on-demand fallback for resilience.

Free: the DevOps AI Incident-Triage Cheat Sheet