AI-Assisted Keystone Token and Policy Debugging in OpenStack

There’s a special kind of dread that settles in when a user opens a ticket that just says “I’m getting access denied.” Denied to what? With which token? At which scope? Is it a 401 or a 403, because those are two completely different conversations. I’ve been untangling Keystone authorization in OpenStack since the days when policy lived in JSON and everyone hand-edited it into oblivion. The model has gotten cleaner, but the debugging is still a maze — and the maze is exactly where a quick AI assistant earns its keep, as long as you keep it on the right side of the credential wall.

First, is it 401 or 403?

This is the question that resolves half of all auth tickets, and people conflate them constantly. A 401 Unauthorized means Keystone could not authenticate you — your token is missing, expired, malformed, or invalid. A 403 Forbidden means Keystone authenticated you just fine but policy says your roles don’t permit this action. They live in entirely different systems: 401 is a token problem, 403 is a policy problem.

So before anything else, I issue a token and confirm it’s even valid:

openstack token issue

If that fails, you’re in 401 territory and the rest of this article is irrelevant — go look at credentials and clock skew. If it succeeds and the user still can’t act, you’re in 403 land, which means policy. I’ll often paste the failing request, the HTTP status, and a sanitized token issue output into ChatGPT and ask it to classify the failure. It’s reliably good at the 401-versus-403 triage because that distinction is well documented — and being right about which branch to take saves you from debugging the wrong half of the system.

Pro Tip: The keystone log will tell you the exact policy rule that denied the request if you bump it to debug. Grep for the target action — the rule name is your search key into policy.yaml.

Understanding the scope of the token

Keystone tokens are scoped, and the scope determines what the token can even be evaluated against. A token can be project-scoped, domain-scoped, system-scoped, or unscoped. A surprising number of 403s are really “the user authenticated with the wrong scope.” Check what roles the principal actually holds and where:

openstack role assignment list --names --user alice --project demo

The --names flag is the difference between a readable table and a wall of UUIDs — always use it. If the user has the member role on demo but they’re trying to perform a system-level operation, no project-scoped token will ever grant it. They need a system-scoped token:

openstack token issue --os-system-scope all

When I’m reasoning about which scope a given action requires, I treat AI as a fast junior who has read the docs but never run the cloud. It can remind me that, say, listing all projects across domains needs system scope — but I confirm against the actual policy.yaml in front of me, because operators override these defaults constantly and the AI is answering from upstream defaults it can’t see your deployment diverging from.

Implied roles and the inheritance you forgot about

Modern Keystone supports role implication — assigning admin can imply member, which implies reader, under the “secure RBAC” model. This is wonderful until you’re debugging why someone has a permission you never explicitly granted them.

openstack implied role list

That output is the map of which roles silently confer others. The default secure RBAC personas — reader (read-only), member (read-write within scope), and admin (privileged) — are designed to stack. If a user can read something they “shouldn’t,” check whether a higher role they hold implies reader. I keep a small library of prompts for walking through role-implication chains, because the human brain is bad at transitive graphs and AI is decent at it — give it the implied-role list and the user’s direct assignments and it’ll trace the closure for you. Then you verify, because a wrong edge in that graph is the difference between “working as designed” and “security incident.”

Reading policy.yaml without losing your mind

Here’s where most of the real 403 debugging happens. Each service ships default policies (these days generated in code via oslo.policy), and operators layer overrides in policy.yaml. To see what a service actually enforces, generate the effective policy:

oslopolicy-policy-generator --namespace nova

That dumps every rule and its default. The override file then changes specific rules. A classic foot-gun: someone copies an old policy.json with deprecated rules, half of which no longer match anything, and now policy is a Frankenstein of new defaults and stale overrides. When I’m staring at a 200-line policy file trying to figure out why os_compute_api:servers:create is denied, I’ll paste the rule and its dependencies into Cursor and ask it to expand the rule references into a flat boolean expression. It’s genuinely faster at unrolling rule:admin_or_owner and not rule:foo than I am by hand. But I never let it write the override I’ll apply to production without me reading every character — a wrong policy rule can silently grant admin to everyone.

Pro Tip: A policy rule of "" (empty string) means “allow everyone.” If you see that on a sensitive action, that’s not a clever default, that’s a hole.

Fernet key rotation: the invisible token invalidator

Here’s the one that produces the most baffling tickets. Keystone signs Fernet tokens with a rotating set of keys. If you rotate keys too aggressively — or your key repositories drift between Keystone nodes behind a load balancer — tokens that were perfectly valid suddenly fail validation with a 401, intermittently, depending on which node serves the request.

openstack token issue
keystone-manage fernet_list_keys

If fernet_list_keys shows a different key set on different control-plane nodes, you’ve found an intermittent-401 generator. The keys must be identical and rotated in lockstep across all nodes, with enough staging keys to cover your max token lifetime. This is a deeply mechanical, easy-to-get-subtly-wrong area, and AI is a good explainer of the staging/primary/secondary key lifecycle — but the rotation itself runs against real key material, and that material never goes into a chat window. Ever.

The credential wall is non-negotiable

Let me state the boundary plainly. I will share with an AI: HTTP status codes, sanitized role assignment tables, policy rule text, implied-role lists, and redacted log lines. I will never share: my admin token, my clouds.yaml, application credentials, or Fernet key material. The AI is a fast junior engineer — useful for unrolling logic and classifying errors, useless and dangerous as a holder of privilege. A token pasted into a chat is a token you have to treat as compromised, and an admin token is the keys to the entire cloud.

When I formalize a fix for a recurring auth bug, I run the proposed policy change through our code review dashboard before it lands, and I file the durable runbooks under the OpenStack category. For the polished, reusable RBAC-debugging checklists, I’ve bundled them into a prompt pack so the next 403 ticket is a checklist, not an archaeology dig.

Conclusion

Keystone debugging is fundamentally about asking the right question in the right order: 401 or 403, then scope, then role assignments, then policy, then — for the truly cursed intermittent failures — Fernet keys. AI accelerates every one of those steps by being a tireless reader of tables and unroller of boolean logic. It just never crosses the line into holding your credentials or writing your production policy unreviewed. Keep it as the fast junior, keep yourself as the human who verifies, and the maze gets a lot shorter.