NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend That Integrates SIMT GPU Kernels Directly into PTX

What You Need Before You Begin
cuda-oxide has version specific requirements for each dependency. Before installing anything, make sure your system meets all of these. The project at the moment Linux only (tested on Ubuntu 24.04).
Linux (Ubuntu 24.04)
Rust at night
CUDA Toolkit 12.x+
LLVM 21+
Clang 21 / libclang-common-21-dev
Git
ⓘ Why LLVM 21?
Simple characters can work in LLVM 20, but anything targeting Hopper or Blackwell – TMA, tcgen05, WGMMA – needs llc from LLVM 21 or later. This is a strict requirement, not a recommendation.
Check your current version of CUDA to ensure compatibility:
nvcc --version
Set up the Rust Night Toolchain
cuda-oxide needs Rust at night and two additional parts: rust-src again rustc-dev. The toolchain is pinned to it nightly-2026-04-03 with rust-toolchain.toml in the repository – it will be installed automatically when you start using the build inside the repo.
If you need to install it manually:
# Install the pinned nightly toolchain
rustup toolchain install nightly-2026-04-03
# Add required components
rustup component add rust-src rustc-dev
--toolchain nightly-2026-04-03
# Confirm the toolchain is active
rustup show
ⓘ Why these parts?
rustc-dev exposes the APIs of the internal cluster that the custom codegen is installed on. rust-src is required for the compiler to find and compile its standard library sources for the target device.
Install LLVM 21 with NVPTX Backend
The cuda-oxide pipeline outputs the LLVM IR script (.ll files) and render them externally llc binary to generate PTX. You need LLVM 21 or later with NVPTX backend enabled.
# Ubuntu/Debian
sudo apt install llvm-21
# Verify the NVPTX backend is present
llc-21 --version | grep nvptx
The pipeline detects itself automatically llc-22 again llc-21 in yours PATH in that order. To pin a specific binary, set the environment variable:
# Pin to a specific llc binary
export CUDA_OXIDE_LLC=/usr/bin/llc-21
⚠ Common Failure
If NVPTX does not appear in the output of llc-21 --versionyour LLVM build was compiled without the NVPTX target. Install from the official LLVM apt repository rather than your distro's default packages, which may exclude the GPU backend.
Install Clang 21 for cuda-bindings Crate
I cuda-bindings crate use bindgen generating FFI obligations to cuda.h during construction. bindgen requirements libclang – and in particular, it requires Clang's documentation utility (incl stddef.h). A blank libclang1-* Runtime package i it is not enough.
# Install the full clang-21 package (includes resource headers)
sudo apt install clang-21
# Alternatively, the -dev header package also works
sudo apt install libclang-common-21-dev
⚠ Symbol of Lost Clang
If you only include the runtime but no headers, host creation will fail implicitly 'stddef.h' file not found error during bindgen. Run cargo oxide doctor in the next step to catch this before trying to build.
Download Repo and install cargo-oxide
cargo-oxide the Cargo subcommand that drives the entire build path – works cargo oxide build, cargo oxide run, cargo oxide debugagain cargo oxide pipeline.
Inside the repo (with examples to try):
git clone
cd cuda-oxide
# cargo oxide works out of the box via a workspace alias
cargo oxide run vecadd
Without the repo (for your projects):
# Install globally from the git source
cargo install
--git
cargo-oxide
# On first run, cargo-oxide fetches and builds the codegen backend
Then verify that all requirements are present with the built-in health check:
cargo oxide doctor
ⓘ What the doctor examines
It verifies your Rust toolchain (nightly, rust-src, rustc-dev), CUDA Toolkit, LLVM version and NVPTX support, Clang/libclang headers, and codegen backend binaries. Fix any red items before proceeding.
Build and Run the vecadd Example
The first canonical example is vecadd – a vector addition kernel that adds two lists of 1,024 f32 values on the GPU and verifies the result on the host.
# Build and run end-to-end
cargo oxide run vecadd
If everything is configured correctly, you will see:
✓ SUCCESS: All 1024 elements correct!
To see the complete integration pipeline – from Rust MIR through each Pliron dialect down to PTX – use:
# Print the full Rust MIR — dialect-mir — mem2reg — dialect-llvm — LLVM IR — PTX trace
cargo oxide pipeline vecadd
Debugging with cuda-gdb:
cargo oxide debug vecadd --tui
ⓘ Output artifacts
A successful build produces two files: target/debug/vecadd (dominant binary) and target/debug/vecadd.ptx (device code). The host loads the PTX file with the CUDA driver at runtime.
Writing Your Own #[kernel] Work
The kernel function is defined as #[kernel]. Use it DisjointSlice of variable output and &[T] read-only input. Access the series' unique hardware index with thread::index_1d().
use cuda_device::{kernel, thread, DisjointSlice};
// Tier 1 safety: race-free by construction, no `unsafe` needed.
// DisjointSlice::get_mut() only accepts a ThreadIndex —
// a hardware-derived opaque type guaranteeing unique writes per thread.
#[kernel]
pub fn scale(input: &[f32], factor: f32, mut out: DisjointSlice<f32>) {
let idx = thread::index_1d();
if let Some(elem) = out.get_mut(idx) {
*elem = input[idx.get()] * factor;
}
}
ⓘ Level 1 Security — how it works
ThreadIndex It is a new kind of opaque around usize which can only be built from hardware built-in registers (threadIdx, blockIdx, blockDim). Since each thread gets a unique value, too DisjointSlice::get_mut() only accept a ThreadIndexwriting does not combine by construction – no unsafe anywhere in the kernel.
Launching the Kernel from Host Code
Host and device code live the same .rs file. The host side uses CudaContext, DeviceBufferonce cuda_launch! macro to manage GPU memory and deployment.
use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};
fn main() {
// Initialize GPU context on device 0
let ctx = CudaContext::new(0).unwrap();
let stream = ctx.default_stream();
let module = load_kernel_module(&ctx, "scale_example").unwrap();
// Upload input data to GPU memory
let data: Vec<f32> = (0..1024).map(|i| i as f32).collect();
let input = DeviceBuffer::from_host(&stream, &data).unwrap();
let mut output = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();
// Dispatch the kernel — LaunchConfig auto-sizes blocks/grids
cuda_launch! {
kernel: scale,
stream: stream,
module: module,
config: LaunchConfig::for_num_elems(1024),
args: [slice(input), 2.5f32, slice_mut(output)]
}.unwrap();
// Download result back to host
let result = output.to_host_vec(&stream).unwrap();
assert!((result[1] - 2.5).abs() < 1e-5);
println!("✓ Kernel ran successfully!");
}
ⓘ Stop cuda_launch! it does
Increases the list of arguments – bits, scalars, and captive closures – to the PTX kernel parameters and sends the kernel to the given stream. Manual argument marshalling is not required.
What to Check Out Next
You have a working cuda-oxide setup. Here are the top value options going forward, in order of complexity:
- Standard characters with monomorphization – try i
genericfor example (cargo oxide run generic) to see howfn scaleconsists of distinguishing PTX characters by type. - It closes with a snapshot – i
host_closurethe example shows how amove |x: f32| x * factorclosures are expanded and passed as PTX kernel parameters by default. - Async GPU execution –
cuda_launch_async!it comes back lazyDeviceOperationthat goes on.sync()or.await. See theasync_mlpagainasync_vecaddexamples. - Shared memory and warp intrinsics – these need to be taken care of
unsafeblocks have written security agreements. See Section 2 in the security model documentation. - GEMM at the Speed-of-Light – i
gemm_solfor example we get 868 TFLOPS on the B200 (58% of cuBLAS SoL) usingcta_group::2CLC, and a 4-phase pipeline. - Blackwell tensor cores – i
tcgen05example target sm_100a with TMEM, MMA, andcta_group::2. Requires LLVM 21+.
ⓘ Known Limitation in v0.1.0
index_2d(stride) is currently labeled as unhealthy – if threads in the same kernel use different stride values, two threads can get &mut T in the same element as the number unsafe in the eyes. Until the correction remains (increasing the step in the type parameter), add the step to one let it binds and reuses it on all calling sites.
Full documentation: nvlabs.github.io/cuda-oxide · Source: github.com/NVlabs/cuda-oxide



