Managing GPUs and Accelerators with OpenStack Cyborg

The day our data science team asked for “GPU instances, like the public cloud has,” I learned that OpenStack’s answer is not as simple as adding a flavor. GPUs, FPGAs, SmartNICs, and other accelerators are special hardware with their own lifecycle — discovery, allocation, attachment, and cleanup — and bolting that onto Nova by hand is fragile. OpenStack Cyborg is the project built specifically to manage accelerators as first-class, schedulable resources, so a tenant can ask for a GPU the same way they ask for vCPUs.

Cyborg sits at an awkward intersection of hardware, Placement, and Nova, and its concepts (device profiles, deployables, attach handles) take a while to internalize. So I treat an AI assistant as my fast junior engineer here: I describe the device profile I want, it drafts the JSON and the commands, and I verify every PCI address and trait before it touches scheduling. Wrong hardware addresses do not error politely — they fail at instance boot, in front of the user.

Confirming Cyborg Sees the Hardware

Cyborg agents run on compute hosts with accelerators and report devices up to the API. First, confirm the devices are discovered:

openstack accelerator device list
openstack accelerator deployable list

If your GPU host shows no devices, the Cyborg agent’s driver is not detecting the hardware — usually a missing vendor driver or a PCI passthrough/SR-IOV config gap, not a Cyborg-API problem. This is the single most common starting point for “Cyborg is broken” tickets.

Building Device Profiles

A device profile is the thing tenants actually request — it bundles the trait and resource requirements for an accelerator. You create it from a small JSON spec:

openstack accelerator device profile create \
  --json '{"name": "gpu-t4", "groups": [{"resources:CUSTOM_ACCELERATOR_GPU": "1", "trait:CUSTOM_GPU_T4": "required"}]}'
openstack accelerator device profile list

The trait names must match what Cyborg reported into Placement. This JSON is exactly the fiddly, error-prone artifact I hand to AI first. I describe “a profile for one T4 GPU” and the model produces correctly structured JSON with the right resources: and trait: keys. Then I cross-check the trait name against the deployable list — because a model will confidently write CUSTOM_GPU_T4 when the real trait is CUSTOM_NVIDIA_T4, and that typo means the scheduler finds zero hosts.

Pro Tip: Before creating any device profile, dump the actual traits Cyborg reported with the deployable show command and paste them into your prompt. Grounding the AI in the real trait strings turns “no valid host” guesswork into a profile that schedules on the first try.

Attaching Accelerators to Instances

To launch an instance with an accelerator, you reference the device profile in the server create request. Nova and Cyborg coordinate the actual attach:

openstack server create gpu-vm \
  --flavor m1.large --image ubuntu-22.04 \
  --accel-device-profile gpu-t4 --network private

When this works, the instance boots with the GPU passed through. When it fails, the error is usually buried in the Nova and Cyborg logs at once.

Debugging the “No Valid Host” GPU Failure

This is the failure that eats afternoons. A GPU request returns NoValidHost even though you know the hardware exists. The cause is almost always a mismatch between what the profile asks for and what Placement actually has. I work it like this:

openstack accelerator device profile show gpu-t4
openstack resource provider list
openstack resource provider trait list <provider-uuid>

I paste all three outputs into Claude and ask it to find the mismatch between the profile’s required traits and the provider’s actual traits. This cross-referencing — three lists, one missing string — is exactly the mechanical work AI excels at, and it finds the dead trait far faster than I do squinting at UUIDs. When a GPU launch failure becomes a user-facing incident, I log the dig through my incident response dashboard.

Coordinating Cyborg, Nova, and Placement

The thing that trips up everyone new to Cyborg is that no single project owns the GPU lifecycle — it is a three-way handshake. Cyborg discovers the device and reports it into Placement as a resource provider with traits. Nova’s scheduler queries Placement to find a host with the requested trait. Then Cyborg performs the actual attach during the boot. If any link in that chain disagrees, you get a failed launch with a misleading error.

Understanding which project to blame is half the battle, and it is where I find an AI assistant genuinely clarifying. I describe the failure symptom and ask the model to walk the chain: “Given a NoValidHost on a GPU request, which of Cyborg, Nova, or Placement is most likely at fault, and what command confirms it?” It produces a clean diagnostic order — check Cyborg discovered the device, check Placement has the trait, check the profile asks for that exact trait. I run those checks myself and confirm the conclusion. The model maps the symptom to the likely culprit fast; I own deciding whether it is right before I change any config. Knowing where to look turns a three-hour cross-project hunt into a ten-minute one.

Cleaning Up Attach Handles

Accelerators have a habit of leaking when instances are deleted uncleanly. Check for orphaned attach state:

openstack accelerator arq list

ARQs (Accelerator Request objects) that outlive their instance indicate a cleanup failure. I have AI draft a reconciliation script that lists ARQs, cross-references live instances, and flags orphans — but I run the deletion by hand. An over-eager cleanup script that deletes a live ARQ takes a working GPU instance offline.

Guardrails

Cyborg touches scheduling and real hardware passthrough, so mistakes show up as failed boots and leaked devices — not silent, but expensive. My rules:

The AI drafts device profiles, JSON specs, and reconciliation scripts; it never holds production credentials or runs the attach/detach against the live cloud.
Every new profile is validated against the real reported traits before any tenant uses it.
Destructive cleanup (deleting ARQs, tearing down deployables) is always human-run after I confirm the device is genuinely orphaned.

Because device-profile JSON is essentially code, I run it through my code review dashboard before it ships. My vetted Cyborg prompts live in the prompt workspace, and the reusable templates are in the OpenStack prompt pack. For editing the JSON specs inline, GitHub Copilot handles the completion.

Wrapping Up

Cyborg makes GPUs and FPGAs first-class OpenStack citizens, which is exactly what your data science and ML teams want. The cost is a stack of fiddly trait matching across Cyborg, Placement, and Nova — and that is precisely where an AI assistant turns hours of UUID-squinting into a quick cross-reference. Keep the model drafting, ground it in real trait names, and keep every attach and delete under your own hand.

As accelerated workloads become the norm rather than the exception, Cyborg stops being a niche project and starts being core infrastructure. Teams that get the trait-and-profile model right early — with a small, well-named library of device profiles tied to real reported traits — avoid the painful retrofit later. An AI assistant makes building that library fast, but the discipline of grounding every profile in actual hardware traits is what keeps GPU launches from becoming a recurring support ticket. Get the foundation clean and your ML teams self-serve the GPUs they need without paging you.

If you want accelerator support stood up properly for an ML workload on your cloud, work with me, or keep reading through the OpenStack category and the prompt library.