Rate Limiting and Traffic Shaping with Neutron QoS

Bandwidth is a shared resource, and on a busy OpenStack cloud one tenant’s runaway backup job can starve everyone else’s network. Neutron QoS exists to stop that: it lets you cap egress and ingress bandwidth per port, guarantee minimum bandwidth for important workloads, and mark DSCP for downstream prioritization. I reached for it after a single tenant saturated a hypervisor’s uplink and paged me at midnight. It works well, but QoS failures are subtle — the policy applies cleanly and the limit silently does nothing — so this guide covers both applying it and proving it actually took effect.

Understand the rule types

A QoS policy is a container for rules, and the rules come in distinct types that behave very differently. Mixing them up is the root of most confusion.

openstack network qos policy list
openstack network qos rule type list

The three you will use: bandwidth limit (a ceiling on egress and/or ingress), minimum bandwidth (a guarantee, which requires placement and SR-IOV or specific backends), and DSCP marking (sets the IP header priority bits). Bandwidth limit is enforced almost everywhere; minimum bandwidth has real backend requirements that, if unmet, mean the “guarantee” is fiction.

Create and apply a policy

You create a policy, add rules, then attach the policy to a network or, more precisely, to a port. Per-port attachment is what gives you tenant-level control.

openstack network qos policy create tenant-cap
openstack network qos rule create tenant-cap \
  --type bandwidth-limit --max-kbps 100000 --max-burst-kbits 10000 \
  --egress
openstack port set <port-uuid> --qos-policy tenant-cap
openstack port show <port-uuid> -f value -c qos_policy_id

Applying to a port rather than the whole network lets you cap one chatty instance without throttling its neighbors. Confirm the qos_policy_id is set on the port — if it is empty, the attach silently failed, often because the port belongs to a different project than the policy.

Pro Tip: Set both egress and ingress limits explicitly. A common mistake is capping only egress, then being baffled when a download-heavy tenant still saturates the uplink inbound. Direction matters, and the default is not symmetric.

Prove the limit is actually enforced

This is the step everyone skips and then regrets. A QoS rule that exists in the database is not the same as a tc filter installed on the hypervisor. Go check the data plane.

# On the compute host running the instance:
tc qdisc show dev <tap-or-vhost-device>
tc class show dev <tap-device>

You should see an htb or tbf qdisc with a rate matching your max-kbps. If tc shows nothing, the L2 agent did not program the limit — check the agent log and confirm the backend (OVS or OVN) supports the rule type you used. I have seen a QoS policy attached and shown in the API while the data plane had no shaping at all because the backend silently ignored an unsupported rule.

Debug a policy that isn’t limiting

When traffic clearly exceeds the cap, work from the database to the data plane. The disconnect is almost always between what Neutron recorded and what the agent installed.

openstack network qos policy show tenant-cap
openstack port show <port-uuid> -f value -c qos_policy_id
journalctl -u neutron-openvswitch-agent -n 100 --no-pager | grep -i qos

The agent log will say whether it tried to apply the rule and whether it failed. With OVN, check the northbound database instead, since OVN handles QoS differently from the OVS agent and tc may not be where the shaping lives.

Where AI speeds things up

QoS debugging is correlation across three views — the policy, the port, and the tc output — and an AI assistant is a solid fast junior for that. I paste the policy definition, the port’s policy ID, and the tc qdisc show output, and ask it to confirm the installed rate matches the configured max-kbps and to flag if the direction or burst settings are off. It reliably catches “you set egress but the saturation is ingress.”

I keep it credential-free and sanitized — port UUIDs and rates are fine, tokens are not — and it never runs openstack port set against production. The model tells me where the policy and the data plane diverge; I apply the fix, because a wrong cap on a critical tenant’s port is an outage. The incident response dashboard is where I run this during a saturation event, Warp makes capturing the tc output across hosts easier, and the prompt library has data-plane verification prompts.

tc -s class show dev <tap-device>   # the stats output I hand the model

Minimum bandwidth is a different beast entirely

Bandwidth limits are easy because they only need a queue on egress. Minimum bandwidth guarantees are hard because a guarantee requires the scheduler to actually find and reserve capacity, which means it flows through Placement and only works on backends that support it. People apply a minimum-bandwidth rule, see no error, and assume it works — when in fact it silently did nothing because the port type cannot honor it.

openstack network qos rule create tenant-cap \
  --type minimum-bandwidth --min-kbps 50000 --egress
openstack port show <port-uuid> -f value -c binding_profile
openstack resource provider list   # placement must track bandwidth inventory

For a minimum-bandwidth guarantee to be real, the port usually must be SR-IOV (or a supported backend), and Placement must carry NET_BW_EGR_KILOBIT_PER_SEC inventory on the relevant resource provider. If that inventory is absent, the scheduler cannot reserve the bandwidth and the “guarantee” is decorative. I learned to check Placement before promising anyone a minimum — a guarantee the infrastructure cannot enforce is worse than no guarantee, because people plan around it.

DSCP marking only matters end to end

The third rule type, DSCP marking, sets priority bits in the IP header so downstream routers can prioritize the traffic. The trap is that DSCP is meaningless unless the physical network honors those marks — and many do not by default, or they re-mark at the edge. A DSCP rule that applies cleanly in Neutron can be completely ignored two hops away.

openstack network qos rule create tenant-cap --type dscp-marking --dscp-mark 26
# Verify the mark survives onto the wire from inside the instance:
tcpdump -v -i any -n 'ip and (ip[1] & 0xfc) >> 2 == 26'

Capturing the actual packets is the only way to know the mark is set and surviving. Neutron will faithfully stamp the DSCP value, but if your switches and routers are not configured to act on it, you have added a header field nobody reads. Coordinate DSCP values with the network team before rolling them out, or you are shaping traffic the underlay ignores entirely.

Roll out fairly across tenants

QoS is only fair if it is consistent. Rather than hand-attaching policies, template the application so every tenant port gets the right tier, and audit for ports that slipped through.

# Find tenant ports with no QoS policy attached:
openstack port list --device-owner compute:nova -f value -c id -c qos_policy_id \
  | awk '$2=="None"{print $1}'

That audit is how you catch the new instance that launched before your automation attached a policy and is now the one tenant with no cap. Close the gap before it becomes the next midnight page.

Conclusion

Neutron QoS is the difference between a cloud where one tenant can starve the rest and one where bandwidth is fair, but the catch is that an applied policy and an enforced limit are not the same thing. Pick the right rule type, attach per port, set both directions, and always verify on the data plane with tc. An AI assistant is a capable fast junior for correlating the policy, port, and tc views and spotting direction or backend mismatches — keep credentials out, verify against the data plane yourself, and run the destructive port set commands by hand. More Neutron guides live under the OpenStack category.