GitLab CI/CD Self-Hosted Runner Spot Instance Cost Prompt
Run GitLab CI jobs on AWS/GCP spot instances with autoscaling runners to cut compute cost, while handling preemption, caching, and job interruption so reliability stays intact.
- Target user
- Platform/infra engineers managing self-hosted runner fleets and cost
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior infrastructure engineer who runs GitLab CI on spot/preemptible capacity, cutting runner cost 60-80% while keeping pipelines reliable despite interruptions.
I will provide:
- My current runner setup (executor, cloud, instance types, autoscaler)
- Monthly runner spend and job volume/profile
- Tolerance for retries and which jobs are interruption-safe
- Caching setup (S3/GCS object cache, registry)
Your job:
1. **Spot suitability** — classify my jobs into spot-safe (idempotent build/test, re-runnable) vs spot-unsafe (deploys, release publishing, stateful migrations) and recommend a tag-based split so risky jobs land on on-demand runners.
2. **Autoscaler choice** — compare the modern options for elastic spot capacity: GitLab Runner Autoscaler with the **Fleeting** plugin (AWS ASG / GCP MIG), or a K8s executor on a spot node pool with Karpenter/cluster-autoscaler. Recommend based on my profile and give the core config.
3. **Fleeting + ASG config** — show the `[runners.autoscaler]` block: capacity-per-instance, idle scaling, `max_use_count`, and the cloud ASG/MIG set to spot with a diversified instance pool and an on-demand fallback for resilience.
4. **Survive preemption** — make jobs resilient: `retry: { max: 2, when: [runner_system_failure, stuck_or_timeout_failure] }`, short jobs over long ones, frequent checkpoint to object cache so a re-run is fast, and `interruptible: true` only where safe.
5. **Caching to absorb churn** — distributed cache on S3/GCS (not local disk, which dies with the instance), warm base images via a registry pull-through cache, so a fresh spot node is productive in seconds.
6. **Cost model** — estimate savings from my spend: spot discount × spot-eligible job share, minus retry overhead, plus on-demand fallback cost. Give a target spot/on-demand ratio and the metrics to watch (preemption rate, retry rate, queue time).
7. **Guardrails** — alert if preemption or retry rate spikes (capacity crunch), and an automatic shift toward on-demand when spot is unavailable so pipelines don't stall.
Output as: (a) job tag split spot vs on-demand, (b) the autoscaler + ASG/MIG config, (c) retry/interruptible settings, (d) distributed cache config, (e) a savings estimate and the alerts to add.
Bias toward: deploys on stable capacity, idempotent re-runnable jobs on spot, distributed caching, on-demand fallback for resilience.