Spot Instances and Auto Scaling With AI

Spot Instances are one of the best deals in AWS and one of the easiest to misuse. The promise is real — large discounts off On-Demand for the same hardware — but the price is that AWS can take the capacity back with two minutes of warning. I’ve seen teams put a single Spot instance type behind a critical service, get a clean run for weeks, and then lose the whole fleet in one regional capacity crunch because every instance was the same shape in the same pool. Spot doesn’t punish you for using it; it punishes you for not diversifying.

The mixed instances policy on an Auto Scaling group is the tool that makes Spot safe, and it has enough knobs — allocation strategies, instance type lists, On-Demand base capacity, spot allocation pools — that getting the combination right is genuinely fiddly. AI is good at drafting that policy and explaining the trade-offs between allocation strategies. It is not good at knowing how interruption-tolerant your actual workload is. So I let AI draft the mixed instances policy and the interruption handling, and I verify the diversification and the drain behavior myself before it touches a production ASG.

Diversify across pools or don’t bother

A Spot “capacity pool” is a single instance type in a single Availability Zone. Each pool has its own independent interruption risk. The entire resilience strategy is to spread your desired capacity across many pools so that losing one barely dents the fleet. That means listing multiple instance types — ideally across families and sizes that are interchangeable for your workload — and letting the ASG draw from whichever pools have spare capacity.

The allocation strategy that matters most is capacity-optimized (or price-capacity-optimized, which weighs price in too). It launches from the pools with the most available capacity, which empirically lowers interruption frequency compared to chasing the absolute lowest price. The old lowest-price strategy concentrates you into cheap, shallow pools that get reclaimed first.

{
  "MixedInstancesPolicy": {
    "LaunchTemplate": {
      "LaunchTemplateSpecification": {
        "LaunchTemplateId": "lt-0abc123def4567890",
        "Version": "$Latest"
      },
      "Overrides": [
        { "InstanceType": "m6i.large" },
        { "InstanceType": "m6a.large" },
        { "InstanceType": "m5.large" },
        { "InstanceType": "m5a.large" },
        { "InstanceType": "m5n.large" }
      ]
    },
    "InstancesDistribution": {
      "OnDemandBaseCapacity": 2,
      "OnDemandPercentageAboveBaseCapacity": 0,
      "SpotAllocationStrategy": "price-capacity-optimized"
    }
  }
}

OnDemandBaseCapacity of 2 keeps a stable On-Demand floor that survives any Spot reclamation, and everything above that floor is Spot. That floor is a workload decision, not a default — a stateless web tier might run zero On-Demand base, while a service that’s painful to cold-start wants a real floor.

The instance type list is where AI helps and hurts

Drafting a good Overrides list is exactly the kind of lookup-heavy task AI accelerates: it knows the instance families and can propose vCPU- and memory-equivalent types across a, i, and n suffixes. The hazard is that “equivalent on paper” isn’t equivalent for your workload. If your app is pinned to a specific architecture, or needs local NVMe, or breaks on the older Xeon in an m5 versus the newer silicon in m6i, a naive list quietly introduces instances that boot but misbehave. So I make the model justify every type against the workload’s real constraints.

Prompt: I have a stateless image-resizing service that uses AVX-512 and needs at least 8 GiB RAM. Propose a Spot-friendly instance type list for capacity-optimized allocation and flag anything risky.

Reply: For 8 GiB and a CPU-bound workload, the *.large size in the general-purpose family gives you 2 vCPU and 8 GiB, so candidates are m6i.large, m6a.large, m5.large, m5n.large, and m5dn.large — diversifying across families and AZs is what lowers your interruption risk. One flag, and it’s the one that matters for you: AVX-512 support is not uniform. The AMD-based types (m6a, m5a) do not provide the same AVX-512 instruction set as the Intel types, so if your binary requires AVX-512 it may fault or fall back to a slow path on those instances. If AVX-512 is a hard requirement, restrict the list to Intel families (m6i, m5, m5n) and accept fewer pools; if your code degrades gracefully without it, keep the AMD types for broader diversification. I can’t tell from here which is true — confirm whether your build hard-requires AVX-512 before committing the list.

That’s the model doing its best work and correctly refusing to guess the one thing that decides the answer. A wider list means more pools and fewer interruptions, but only if every type actually runs your binary. I verify the CPU requirement against the build, not against the model’s assumption.

Interruption handling is the half everyone skips

Diversification reduces how often you’re interrupted; it never makes it zero. Every Spot instance can still get a two-minute warning, so the workload has to drain gracefully. The signal arrives two ways: the EC2 instance metadata endpoint exposes a Spot interruption notice, and EventBridge emits a rebalance recommendation earlier, before the formal notice. I act on both — rebalance to start draining proactively, interruption notice as the hard deadline.

For an ASG behind a load balancer, the cleanest path is Capacity Rebalancing plus a lifecycle hook that drains connections.

aws autoscaling update-auto-scaling-group \
  --auto-scaling-group-name img-resize-asg \
  --capacity-rebalance

aws autoscaling put-lifecycle-hook \
  --auto-scaling-group-name img-resize-asg \
  --lifecycle-hook-name spot-drain \
  --lifecycle-transition autoscaling:EC2_INSTANCE_TERMINATING \
  --heartbeat-timeout 120 \
  --default-result CONTINUE

The node-side agent watches the metadata endpoint and, on a notice, deregisters from the target group and finishes in-flight work before the two minutes run out.

TOKEN=$(curl -sX PUT "http://169.254.169.254/latest/api/token" \
  -H "X-aws-ec2-metadata-token-ttl-seconds: 60")
curl -s -H "X-aws-ec2-metadata-token: $TOKEN" \
  http://169.254.169.254/latest/meta-data/spot/instance-action

A 404 there means no interruption pending; a JSON body with action: terminate and a time means start draining now. The default-result CONTINUE on the lifecycle hook matters — set it to ABANDON and a slow drain gets force-terminated, defeating the point.

On Kubernetes, push the logic to the cluster

If these nodes back an EKS cluster, the same principles move up a layer: capacity rebalancing on the node groups, the AWS Node Termination Handler (or Karpenter’s native interruption handling) to cordon and drain on the notice, and pod disruption budgets so the scheduler doesn’t evict more replicas than the service can lose. The instance diversification logic is identical; Karpenter just expresses it as flexible requirements instead of a fixed Overrides list. AI drafts those requirement blocks well, and I verify the same way — does every satisfying instance actually run the workload, and does the drain finish inside two minutes.

Where I stay in the loop

Spot is safe when two things are true, and AI can draft both but verify neither. First, the fleet is diversified across enough viable pools that losing one is a non-event — and “viable” is a workload judgment AI can propose but not confirm. Second, every instance drains cleanly inside the two-minute window — which depends on how long your real requests take, not on a config flag. So the model drafts the mixed instances policy, the allocation strategy, the interruption agent, and the Karpenter requirements; I verify that the instance list truly fits the workload and that the drain actually completes under load before any of it scales out.

The same fleet’s cost picture pairs naturally with AWS cost optimization with AI, more compute and scaling material lives under the AWS category, and I keep the Spot diversification and drain-check prompts in the prompt library so every new ASG gets interrogated the same way.