How DevOps Teams Use AI to Reduce Cloud Costs (FinOps)
How DevOps teams use AI to reduce cloud costs: surface waste from billing data, right-size Kubernetes, explain spikes, and draft IaC fixes humans approve.
- #finops
- #cloud-cost
- #ai
- #kubernetes
- #devops
DevOps teams use AI to reduce cloud costs by pointing it at the data they already have — billing exports, usage metrics, and infrastructure-as-code — so it can surface waste, recommend right-sizing, explain cost spikes in plain English, and draft the exact IaC change to fix each one. The AI does the tedious correlation work that nobody has time for: reading a 40,000-line cost CSV, cross-referencing idle resources against tags, and turning “your bill jumped $3,100 on Tuesday” into “an autoscaling group scaled to 12 nodes at 02:14 and never scaled back.” A human approves every change before it ships. AI never deletes a resource or applies a Terraform plan on its own.
I’ve watched cloud bills balloon at three different companies, and the pattern is always the same. The waste isn’t hidden — it’s in the billing console in plain sight — but nobody has the hours to dig through it line by line. That’s the gap AI closes. Below is how I actually use it, with the prompts I feed it, what comes back, and the guardrail on every step.
Where does cloud money actually leak?
Before AI helps, it’s worth naming where the money goes, because the leaks are boringly predictable across every cloud account I’ve audited:
- Idle and orphaned resources — unattached EBS volumes, idle load balancers, NAT gateways for VPCs nobody uses, dev databases running 24/7.
- Over-provisioned compute — instances sized for a load test that ended six months ago; Kubernetes pods requesting 4 CPU and using 0.3.
- Storage sprawl — snapshots from 2023, logs in hot storage that should be in archive, duplicate backups.
- Egress — cross-AZ chatter and data leaving the cloud, billed per GB and invisible until the invoice lands.
- Unused commitments — Savings Plans and Reserved Instances bought for workloads that have since been re-architected.
- No tagging — when you can’t attribute spend to a team, nobody owns the cleanup.
The recurring theme: this is a data analysis problem, not a fixing problem. The fix is usually one line of Terraform. Finding which line, out of thousands, is the expensive part — and that’s exactly what large language models are good at. FinOps tooling like Kubecost, Infracost, and your cloud provider’s cost explorer surfaces the raw numbers; AI turns those numbers into a ranked, explained, actionable list.
How does AI find waste in a billing export?
Your cloud provider’s Cost and Usage Report (AWS CUR), billing export (GCP), or cost analysis CSV (Azure) is the richest waste-finding artifact you have, and almost nobody reads it because it’s enormous and unfriendly. This is the single highest-leverage place to start.
Here’s the workflow. I export the last 30 days grouped by service and resource, then hand the AI a representative slice (or a pre-aggregated summary if the file is huge — never paste 40,000 rows blindly).
What I feed it:
Here’s a CSV of our AWS Cost and Usage Report for last month, grouped by service, usage type, and resource tag. Identify the top 10 sources of likely waste — not just the top spend. For each, explain why you think it’s waste, estimate the monthly dollar impact, and tell me what data would confirm it. Do not assume anything is safe to delete.
What comes back is a ranked table: “NAT Gateway data processing in us-east-1 — $890/mo, concentrated in one account with no production tag, likely a forgotten dev VPC; confirm by checking flow logs for active traffic.” That last clause matters — the AI flags what to verify, it doesn’t conclude. I then check the flow logs myself before touching anything.
The win here isn’t magic insight. It’s that the AI reads all 40,000 rows in seconds and ranks them by opportunity, which is the work I’d otherwise never do. For a deeper, automated version of this in your pipeline, see cutting cloud bills with Infracost in your Terraform pipeline — Infracost prices the diff before merge, and AI reads that diff for you.
How does AI find idle and orphaned resources?
Orphans are resources still being billed with nothing attached. Unattached EBS volumes are the classic example — you terminate an instance, forget to delete its volume, and pay for gigabytes forever.
I pull an inventory with the cloud CLI and feed the JSON to the AI:
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[].{ID:VolumeId,Size:Size,Created:CreateTime,AZ:AvailabilityZone}' \
--output json
What I feed it: that JSON, plus “Group these unattached volumes by age and size, total the monthly cost at $0.08/GB-month, and draft an aws_ebs_volume removal list — but mark anything created in the last 7 days as ‘hold, may be in flux.’”
What comes back is a categorized list with a dollar total and a draft change. The guardrail is non-negotiable: the AI produces a list, a human reviews each ID, and the actual deletion happens through a reviewed Terraform plan or a tagged-for-deletion grace period — never an immediate aws ec2 delete-volume loop. An “available” volume might be a snapshot source someone restores from next week. AI recommends; you confirm intent before anything dies.
Can AI right-size Kubernetes requests and limits?
Kubernetes is where over-provisioning hides best, because requests are set once and never revisited. A team requests 2 CPU “to be safe,” uses 0.2, and the scheduler reserves the full 2 — so your nodes fill up on reservations, not actual usage, and you pay for nodes you don’t need.
The data you need is the gap between requested and actual usage:
kubectl top pods -A --sum=true
Then the requests from the manifests or kubectl get pods -o yaml. With Kubecost or the VPA recommender, you get historical percentiles, which are far better than a point-in-time snapshot.
What I feed it:
Here’s
kubectl topoutput and the current resource requests for these deployments, plus the p95 and p99 CPU/memory from the last 14 days. Recommend new requests and limits per container. Target ~40% headroom over p95. Flag any workload where p99 is more than 3x p95 as ‘spiky — don’t squeeze.’ Output a YAML patch I can review.
What comes back is a per-container recommendation with reasoning: “checkout-api requests 2000m, p95 is 180m, p99 is 240m — drop request to 350m, keep limit at 500m, ~$140/mo saved across replicas.” The spiky-workload flag is the guardrail in action: the AI refuses to over-optimize a service that bursts, because squeezing requests on a bursty pod causes throttling and OOMKills — a reliability incident, not a savings. Better-packed reservations also mean the cluster autoscaler can run fewer nodes, which is where the actual dollar savings land.
For the full methodology on requests, limits, and the autoscaling interplay, I wrote a dedicated guide: right-sizing pod resource requests, limits, and autoscaling. The Cost & Capacity topic in the Kubernetes Prompt Pack ships ready-made prompts for exactly this loop, so you’re not writing the analysis prompt from scratch every sprint.
How does AI explain a cost spike?
This is my favorite use, because it answers the question every engineering manager asks in a panic: “Why did the bill jump?”
Without AI, this is an afternoon of pivot tables. With AI, I export the daily cost broken down by service for the spike window plus a normal baseline week, and ask:
Here’s daily cost by service for June 1–14. Spend was flat around $400/day, then jumped to $1,500/day on June 10 and stayed there. Tell me which service drove the delta, the most likely cause, and what to check to confirm. Walk me through your reasoning.
What comes back is a diagnosis with a paper trail: “The entire $1,100/day delta is in EC2-Other → data transfer, starting June 10. This pattern — a step change that persists, isolated to egress — usually means a new cross-AZ data path or a service started replicating across regions. Check whether a deployment on June 10 changed an endpoint or added a replica in a second AZ.” Nine times out of ten that points me straight at the offending change.
The AI is doing correlation, not deciding anything. It reads the shape of the data and proposes a hypothesis I then verify against the deploy log. That “explain your reasoning” clause is doing real work — it forces a checkable chain instead of a confident guess.
Can AI handle storage tiering and snapshot sprawl?
Storage waste is slow and silent. Logs sit in standard/hot storage when they’re never read after a week. Snapshots accumulate because deleting them feels risky. Lifecycle policies exist but nobody writes them.
I feed the AI a snapshot inventory (aws ec2 describe-snapshots --owner-ids self) and bucket metrics, and ask it to draft a lifecycle policy — transition objects to infrequent-access after 30 days, archive after 90, and propose a snapshot retention rule (keep daily for 7 days, weekly for 4 weeks, monthly for 12). It returns the Terraform aws_s3_bucket_lifecycle_configuration and a snapshot-pruning plan with estimated savings.
The guardrail: snapshots are someone’s disaster-recovery plan. The AI proposes a retention policy for human review and applies nothing — and I confirm the retention window with whoever owns the data before it merges. Tiering is safe to automate via lifecycle rules; deletion is not.
What about egress and data-transfer costs?
Egress is the cost nobody budgets for because it’s invisible until the invoice. Cross-AZ traffic, NAT gateway processing, and data leaving the cloud all bill per gigabyte. AI helps by reading VPC flow logs or the data-transfer line items and identifying the heavy talkers: “82% of your cross-AZ transfer is between the payments service in AZ-a and its database in AZ-b — co-locating them or adding a read replica in AZ-a would cut this.” It’s a recommendation backed by the traffic data, and you decide whether the architecture change is worth it.
Can AI analyze Savings Plans and commitments?
Commitment purchases (Reserved Instances, Savings Plans, committed-use discounts) are a math problem AI is genuinely good at. You feed it your steady-state usage over the trailing 90 days and your current commitments, and ask for a coverage recommendation: how much to commit, at what term, and the break-even. It models the trade-off — more commitment means a deeper discount but less flexibility if you re-architect.
The guardrail here is bigger than usual: a commitment is a financial contract, often a one- or three-year obligation. AI does the modeling; a human with budget authority signs off. I treat its output as a well-prepared proposal for a finance conversation, never an action.
How does AI fix the tagging and showback problem?
You can’t reduce costs you can’t attribute. If 40% of spend is untagged, no team owns the cleanup. AI helps two ways: it audits which resources are missing required tags and drafts the tag taxonomy, and it builds the showback summary — “Team A: $4,200, Team B: $1,800, untagged: $3,100” — from the billing export, so spend becomes a conversation each team can act on. Pair this with a code review gate that flags any new resource missing a team or cost-center tag at merge time, and the untagged pile stops growing.
AI cloud-cost help, by area
Here’s how it maps across the major waste areas — and crucially, who signs off on each:
| Cost area | Typical waste | How AI helps | Who approves |
|---|---|---|---|
| Billing export analysis | Waste buried in 40k CSV rows | Reads and ranks top opportunities by dollar impact | DevOps reviews the ranked list |
| Idle / orphaned resources | Unattached volumes, idle LBs, dead NAT gateways | Inventories and drafts a removal list with grace flags | Engineer confirms each ID, then plan |
| Over-provisioned compute | Instances sized for last year’s load | Recommends right-size based on actual usage percentiles | Service owner approves the change |
| Kubernetes requests/limits | Requests 10x actual usage | Drafts YAML patch with headroom and spiky-workload flags | Platform team reviews and applies |
| Storage tiering / snapshots | Hot storage for cold data, snapshot pile-up | Drafts lifecycle and retention policy | Data owner confirms retention window |
| Egress / data transfer | Cross-AZ chatter, region replication | Identifies heavy talkers from flow logs | Architect decides on topology change |
| Commitments / Savings Plans | Over- or under-committed | Models coverage and break-even | Finance / budget owner signs |
| Tagging / showback | Untagged spend nobody owns | Audits tags, builds showback summary | Each team owns their slice |
Notice that the “who approves” column is never empty and never “the AI.” That’s the whole model.
The guardrails: AI recommends, humans approve
Every section above repeats the same rule because it’s the rule that keeps you safe:
- AI recommends, humans approve. The AI’s job ends at a reviewed proposal. A person makes the call.
- Never auto-delete. No agent runs
delete-volumein a loop. Deletions go through a reviewed plan or a tag-for-deletion grace period so anything important can be rescued. - Verify in a plan first. Cost changes are infrastructure changes. They go through
terraform plan, code review, and the same pipeline as any other change — never a one-off console click an AI suggested. (AI is excellent at reading a plan to confirm it does only what you intended — see working with AI on Terraform plans.) - Treat AI output as a draft, not a decision. The “explain your reasoning” and “tell me what to verify” clauses in my prompts exist to make the output checkable. If you can’t trace the recommendation back to data, don’t act on it.
These aren’t bureaucratic friction. The fastest way to turn a cost-savings project into an outage is to let an over-eager script delete a “useless” resource that turned out to be load-bearing.
If you want a head start on the prompts, the automation category and the prompts library have reusable templates for cost analysis, right-sizing, and billing-export review. Most of these run fine in Claude or ChatGPT — paste your data, paste the prompt, review the output.
FAQ
Can AI automatically cut my cloud bill? No, and you shouldn’t want it to. AI finds the waste, quantifies it, and drafts the fix — but a human approves every change and it ships through your normal pipeline. The savings are real; the autonomy is not. Auto-deleting resources is how you trade a cost problem for an outage.
Is FinOps just for big companies? No. The percentage of waste is often higher in small accounts, because there’s no dedicated cost team watching. A single engineer with an AI assistant and a billing export can find 20–30% savings in an afternoon. FinOps is a practice, not a headcount — AI lets a one-person team run that practice.
What’s the fastest cloud cost win? Idle and orphaned resources, every time. Export your unattached volumes, idle load balancers, and old snapshots, hand the list to AI to rank by cost, verify the top few, and remove them through a reviewed plan. It’s low-risk, high-return, and you can do it this week. Right-sizing Kubernetes requests is the second-fastest if you run a busy cluster.
Do I need expensive FinOps tooling for this?
It helps but isn’t required to start. Your cloud provider’s cost explorer plus kubectl top plus an AI assistant covers most of the analysis. Tools like Kubecost and Infracost add historical percentiles and pre-merge cost diffs that make the AI’s recommendations sharper — they’re a force multiplier, not a prerequisite.
How do I keep costs from creeping back up? Shift cost-awareness left. Price infrastructure changes before merge with Infracost, gate new resources on required tags in code review, and re-run the right-sizing analysis every sprint. AI makes each of these cheap enough to do continuously, which is what stops the slow creep.
Conclusion
AI doesn’t reduce your cloud bill by being clever about infrastructure. It reduces it by reading the data you already have — the billing export, the usage metrics, the plan — faster and more thoroughly than any human has time for, then handing you a ranked, explained, ready-to-review list of fixes. The leverage is in the analysis; the safety is in keeping a human on every approval. Start with idle resources this week, wire right-sizing into your sprint cadence, and let the AI do the reading while you make the calls.