Generative AI

NVIDIA AI Just Released cuda-oxide: An Experimental Rust-to-CUDA Compiler Backend That Integrates SIMT GPU Kernels Directly into PTX

Step 01 of 09 · Prerequisites

What You Need Before You Begin

cuda-oxide has version specific requirements for each dependency. Before installing anything, make sure your system meets all of these. The project at the moment Linux only (tested on Ubuntu 24.04).

Linux (Ubuntu 24.04)
Rust at night
CUDA Toolkit 12.x+
LLVM 21+
Clang 21 / libclang-common-21-dev
Git

ⓘ Why LLVM 21?
Simple characters can work in LLVM 20, but anything targeting Hopper or Blackwell – TMA, tcgen05, WGMMA – needs llc from LLVM 21 or later. This is a strict requirement, not a recommendation.

Check your current version of CUDA to ensure compatibility:

nvcc --version

Step 02 of 09 · Install Rust Nightly

Set up the Rust Night Toolchain

cuda-oxide needs Rust at night and two additional parts: rust-src again rustc-dev. The toolchain is pinned to it nightly-2026-04-03 with rust-toolchain.toml in the repository – it will be installed automatically when you start using the build inside the repo.

If you need to install it manually:

# Install the pinned nightly toolchain
rustup toolchain install nightly-2026-04-03

# Add required components
rustup component add rust-src rustc-dev 
  --toolchain nightly-2026-04-03

# Confirm the toolchain is active
rustup show

ⓘ Why these parts?
rustc-dev exposes the APIs of the internal cluster that the custom codegen is installed on. rust-src is required for the compiler to find and compile its standard library sources for the target device.

Step 03 of 09 · Install LLVM 21

Install LLVM 21 with NVPTX Backend

The cuda-oxide pipeline outputs the LLVM IR script (.ll files) and render them externally llc binary to generate PTX. You need LLVM 21 or later with NVPTX backend enabled.

# Ubuntu/Debian
sudo apt install llvm-21

# Verify the NVPTX backend is present
llc-21 --version | grep nvptx

The pipeline detects itself automatically llc-22 again llc-21 in yours PATH in that order. To pin a specific binary, set the environment variable:

# Pin to a specific llc binary
export CUDA_OXIDE_LLC=/usr/bin/llc-21

⚠ Common Failure
If NVPTX does not appear in the output of llc-21 --versionyour LLVM build was compiled without the NVPTX target. Install from the official LLVM apt repository rather than your distro's default packages, which may exclude the GPU backend.

Step 04 of 09 · Install Clang

Install Clang 21 for cuda-bindings Crate

I cuda-bindings crate use bindgen generating FFI obligations to cuda.h during construction. bindgen requirements libclang – and in particular, it requires Clang's documentation utility (incl stddef.h). A blank libclang1-* Runtime package i it is not enough.

# Install the full clang-21 package (includes resource headers)
sudo apt install clang-21

# Alternatively, the -dev header package also works
sudo apt install libclang-common-21-dev

⚠ Symbol of Lost Clang
If you only include the runtime but no headers, host creation will fail implicitly 'stddef.h' file not found error during bindgen. Run cargo oxide doctor in the next step to catch this before trying to build.

Step 05 of 09 · Add cargo-oxide

Download Repo and install cargo-oxide

cargo-oxide the Cargo subcommand that drives the entire build path – works cargo oxide build, cargo oxide run, cargo oxide debugagain cargo oxide pipeline.

Inside the repo (with examples to try):

git clone 
cd cuda-oxide

# cargo oxide works out of the box via a workspace alias
cargo oxide run vecadd

Without the repo (for your projects):

# Install globally from the git source
cargo install 
  --git  
  cargo-oxide

# On first run, cargo-oxide fetches and builds the codegen backend

Then verify that all requirements are present with the built-in health check:

cargo oxide doctor

ⓘ What the doctor examines
It verifies your Rust toolchain (nightly, rust-src, rustc-dev), CUDA Toolkit, LLVM version and NVPTX support, Clang/libclang headers, and codegen backend binaries. Fix any red items before proceeding.

Step 06 of 09 · Run Your First Kernel

Build and Run the vecadd Example

The first canonical example is vecadd – a vector addition kernel that adds two lists of 1,024 f32 values ​​on the GPU and verifies the result on the host.

# Build and run end-to-end
cargo oxide run vecadd

If everything is configured correctly, you will see:

✓ SUCCESS: All 1024 elements correct!

To see the complete integration pipeline – from Rust MIR through each Pliron dialect down to PTX – use:

# Print the full Rust MIR — dialect-mir — mem2reg — dialect-llvm — LLVM IR — PTX trace
cargo oxide pipeline vecadd

Debugging with cuda-gdb:

cargo oxide debug vecadd --tui

ⓘ Output artifacts
A successful build produces two files: target/debug/vecadd (dominant binary) and target/debug/vecadd.ptx (device code). The host loads the PTX file with the CUDA driver at runtime.

Step 07 of 09 · Write the Kernel

Writing Your Own #[kernel] Work

The kernel function is defined as #[kernel]. Use it DisjointSlice of variable output and &[T] read-only input. Access the series' unique hardware index with thread::index_1d().

use cuda_device::{kernel, thread, DisjointSlice};

// Tier 1 safety: race-free by construction, no `unsafe` needed.
// DisjointSlice::get_mut() only accepts a ThreadIndex —
// a hardware-derived opaque type guaranteeing unique writes per thread.
#[kernel]
pub fn scale(input: &[f32], factor: f32, mut out: DisjointSlice<f32>) {
    let idx = thread::index_1d();
    if let Some(elem) = out.get_mut(idx) {
        *elem = input[idx.get()] * factor;
    }
}

ⓘ Level 1 Security — how it works
ThreadIndex It is a new kind of opaque around usize which can only be built from hardware built-in registers (threadIdx, blockIdx, blockDim). Since each thread gets a unique value, too DisjointSlice::get_mut() only accept a ThreadIndexwriting does not combine by construction – no unsafe anywhere in the kernel.

Step 08 of 09 · Launch from Host

Launching the Kernel from Host Code

Host and device code live the same .rs file. The host side uses CudaContext, DeviceBufferonce cuda_launch! macro to manage GPU memory and deployment.

use cuda_core::{CudaContext, DeviceBuffer, LaunchConfig};
use cuda_host::{cuda_launch, load_kernel_module};

fn main() {
    // Initialize GPU context on device 0
    let ctx    = CudaContext::new(0).unwrap();
    let stream = ctx.default_stream();
    let module = load_kernel_module(&ctx, "scale_example").unwrap();

    // Upload input data to GPU memory
    let data: Vec<f32> = (0..1024).map(|i| i as f32).collect();
    let input  = DeviceBuffer::from_host(&stream, &data).unwrap();
    let mut output = DeviceBuffer::<f32>::zeroed(&stream, 1024).unwrap();

    // Dispatch the kernel — LaunchConfig auto-sizes blocks/grids
    cuda_launch! {
        kernel: scale,
        stream: stream,
        module: module,
        config: LaunchConfig::for_num_elems(1024),
        args: [slice(input), 2.5f32, slice_mut(output)]
    }.unwrap();

    // Download result back to host
    let result = output.to_host_vec(&stream).unwrap();
    assert!((result[1] - 2.5).abs() < 1e-5);
    println!("✓ Kernel ran successfully!");
}

ⓘ Stop cuda_launch! it does
Increases the list of arguments – bits, scalars, and captive closures – to the PTX kernel parameters and sends the kernel to the given stream. Manual argument marshalling is not required.

Step 09 of 09 · Next Steps

What to Check Out Next

You have a working cuda-oxide setup. Here are the top value options going forward, in order of complexity:

  • Standard characters with monomorphization – try i generic for example (cargo oxide run generic) to see how fn scale consists of distinguishing PTX characters by type.
  • It closes with a snapshot – i host_closure the example shows how a move |x: f32| x * factor closures are expanded and passed as PTX kernel parameters by default.
  • Async GPU executioncuda_launch_async! it comes back lazy DeviceOperation that goes on .sync() or .await. See the async_mlp again async_vecadd examples.
  • Shared memory and warp intrinsics – these need to be taken care of unsafe blocks have written security agreements. See Section 2 in the security model documentation.
  • GEMM at the Speed-of-Light – i gemm_sol for example we get 868 TFLOPS on the B200 (58% of cuBLAS SoL) using cta_group::2CLC, and a 4-phase pipeline.
  • Blackwell tensor cores – i tcgen05 example target sm_100a with TMEM, MMA, and cta_group::2. Requires LLVM 21+.

ⓘ Known Limitation in v0.1.0
index_2d(stride) is currently labeled as unhealthy – if threads in the same kernel use different stride values, two threads can get &mut T in the same element as the number unsafe in the eyes. Until the correction remains (increasing the step in the type parameter), add the step to one let it binds and reuses it on all calling sites.

Full documentation: nvlabs.github.io/cuda-oxide · Source: github.com/NVlabs/cuda-oxide

Source link

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button