Scripting AWS with boto3 Without the Rough Edges

The AWS CLI is great for one-liners. The moment you need a loop, a conditional, or to act on the result of one call inside another, you’re better off in Python with boto3. It’s the same API surface, but you get real data structures instead of piping JSON through jq and praying.

The catch is that boto3’s friendliness hides a lot of sharp edges: silent pagination cutoffs, throttling, credential ambiguity, and the very real ability to spend money fast. I’ve debugged scripts that “worked” but only ever saw the first 50 of 800 instances, and scripts that got rate-limited into uselessness against a large account. Here’s how to avoid those.

Clients vs resources, and credential hygiene

boto3 gives you two interfaces: low-level client (a thin wrapper over API calls) and higher-level resource (object-oriented, more Pythonic). The resource interface is being de-emphasized by AWS, so for new code I default to clients. They map directly to the API docs, which makes debugging far easier.

import boto3

# Don't bake credentials into the script. Let the default chain resolve them:
# env vars -> shared config -> IAM role. Pass a profile or region explicitly.
session = boto3.Session(profile_name="prod-readonly", region_name="us-east-1")
ec2 = session.client("ec2")

Never put access keys in the source. Use a named profile locally and an IAM role on servers. And give scripts the least privilege they need — a reporting script gets a read-only role, not your admin keys. A script with ec2:* is a script one typo away from terminating production.

Pagination is the #1 silent bug

This is the mistake I see most. Many AWS APIs return at most ~50–1000 results per call and hand you a NextToken for the rest. If you call describe_instances() once and iterate the result, you’re processing the first page only and silently ignoring everything else. Your script “succeeds” while doing a fraction of the work.

Use paginators. Always.

def all_running_instance_ids(ec2):
    ids = []
    paginator = ec2.get_paginator("describe_instances")
    pages = paginator.paginate(
        Filters=[{"Name": "instance-state-name", "Values": ["running"]}]
    )
    for page in pages:
        for reservation in page["Reservations"]:
            for inst in reservation["Instances"]:
                ids.append(inst["InstanceId"])
    return ids

The paginator handles the NextToken dance for you. If a boto3 call you’re making has a paginator (ec2.can_paginate("describe_instances") tells you), use it — full stop.

Throttling and retries

Hit any AWS API hard enough and you’ll get ThrottlingException. boto3 has built-in retries, but the default mode is conservative. For batch jobs, switch to adaptive retries which back off intelligently under throttling:

from botocore.config import Config

cfg = Config(retries={"max_attempts": 10, "mode": "adaptive"})
ec2 = session.client("ec2", config=cfg)

adaptive mode adds client-side rate limiting that responds to throttle signals — exactly what you want when sweeping a large account. Don’t write your own retry loop on top of this; you’ll just fight the built-in one.

Handle errors by code, not by string

Botocore raises ClientError, and the useful information is inside it. Match on the error code, not the message text (which AWS can change):

from botocore.exceptions import ClientError

def stop_instance(ec2, instance_id):
    try:
        ec2.stop_instances(InstanceIds=[instance_id])
    except ClientError as e:
        code = e.response["Error"]["Code"]
        if code == "IncorrectInstanceState":
            print(f"{instance_id} not in a stoppable state, skipping")
        elif code == "UnauthorizedOperation":
            raise SystemExit("IAM policy is missing ec2:StopInstances")
        else:
            raise

That UnauthorizedOperation branch saves real time — instead of a confusing stack trace, you get told exactly which permission is missing.

Dry-run before you mutate

Many EC2 mutating calls support a DryRun parameter that validates permissions and parameters without doing the thing. Use it as a preflight:

def safe_terminate(ec2, instance_id):
    try:
        ec2.terminate_instances(InstanceIds=[instance_id], DryRun=True)
    except ClientError as e:
        if e.response["Error"]["Code"] == "DryRunOperation":
            pass  # we WOULD be allowed — proceed
        else:
            raise  # real problem: permissions, bad ID, etc.
    ec2.terminate_instances(InstanceIds=[instance_id])

For any destructive script, I also add a --apply flag that defaults to off. The script prints exactly what it would do, and only mutates when you explicitly opt in. The number of “oops, that wasn’t supposed to run against prod” incidents that a default-dry-run prevents is worth the ten extra lines.

A few more habits

Filter server-side, not client-side. Pass Filters= to the API instead of pulling everything and filtering in Python. It’s faster, cheaper, and avoids pagination volume.
Tag everything your scripts create. A CreatedBy: my-script tag makes cleanup and cost attribution possible later.
Watch the cost of “harmless” reads. Some APIs (CloudWatch GetMetricData, Cost Explorer) bill per call. A tight polling loop can run up a surprising bill.
Pin your region explicitly. Relying on ambient region config is how a script meant for us-east-1 quietly operates on eu-west-1.

boto3 rewards a little discipline. Use clients, paginate everything, configure adaptive retries, match errors by code, and default destructive operations to dry-run. Do that and your cloud scripts become boring — which is the highest compliment you can pay automation.

More patterns live in the Bash & Python automation guides, and you can scaffold your own with a starter prompt.

Cloud automation can incur cost and cause irreversible changes. Run against a non-production account with least-privilege credentials before trusting any script in prod.