OxiCUDA 0.1.0 Released — A Pure Rust Replacement for the NVIDIA CUDA Toolkit

The NVIDIA CUDA Toolkit, rebuilt in pure Rust — type-safe, memory-safe, and free of nvcc.

Today we released OxiCUDA 0.1.0 — a pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack, generating optimized PTX directly from Rust data structures with the NVIDIA driver as its only runtime dependency.

No CUDA SDK. No nvcc. No C/C++ toolchain. OxiCUDA loads the NVIDIA driver at runtime (libcuda.so on Linux, nvcuda.dll on Windows) through libloading, and that is the whole of its native footprint. Everything above the driver — the PTX it emits, the kernels it tunes, the BLAS and DNN routines it exposes — is written in safe Rust and compiles with cargo build alone. The result links into a single static binary (or targets WASM), and runs across the modern NVIDIA lineup from Turing through Blackwell.

Why OxiCUDA matters

The CUDA Toolkit is the foundation of modern GPU computing, and it carries the costs of its age. Its libraries are C and C++, with the memory-unsafety that implies. Building against it means dragging in nvcc, the full SDK, and a C/C++ toolchain at build time. It locks you to one vendor’s tooling, it ports poorly, and it is awkward to integrate cleanly with safe Rust — you end up wrapping FFI surfaces and trusting that nothing on the other side misbehaves.

OxiCUDA takes a different path. The kernels are described in a Rust PTX DSL and emitted as optimized PTX assembly; a built-in autotuner benchmarks kernel variants per GPU architecture to push throughput toward the peak of each device. Memory is owned and typed — DeviceBuffer<T> instead of raw pointers — so the borrow checker, not a runtime crash, catches your mistakes. And because the only thing it talks to is the driver, integration with the rest of your safe-Rust stack is direct.

The 0.1.0 README states performance targets against the corresponding CUDA libraries (these are goals the kernels and autotuner aim for, not audited measurements):

SGEMM FP32: ≥95% of cuBLAS
HGEMM FP16: ≥95% of cuBLAS (Tensor Core WMMA/MMA)
Batched GEMM: ≥95% of cuBLAS (Stream-K)
Conv FP16: ≥90% of cuDNN (implicit GEMM + Winograd)
FlashAttention: ≥90% of FlashAttention-2
FFT (power-of-two): ≥90% of cuFFT
SpMV CSR: ≥85% of cuSPARSE
LU / QR / SVD: ≥85% of cuSOLVER

Technical Deep Dive: Rebuilding the CUDA Toolkit in Pure Rust

OxiCUDA is organized as 10 Volumes plus 7 backends — 28 crates in total, layered from the driver up to full training and inference stacks.

Foundation (Vol. 1, 4 crates). oxicuda-driver wraps the CUDA Driver API via dynamic loading — devices, contexts, streams, events, modules, a multi-GPU pool, and occupancy queries. oxicuda-memory provides the typed memory model: DeviceBuffer<T>, PinnedBuffer<T>, unified memory, an async pool, virtual memory, and 2D/3D and peer copies. oxicuda-launch carries Dim3, LaunchParams, the launch! macro, and cooperative, cluster (Hopper), and graph launches. oxicuda-runtime is the high-level cudaRT-style surface — streams, events, textures, surfaces.

PTX Codegen + Autotuner (Vol. 2, 2 crates). oxicuda-ptx is a full PTX IR type system with a Rust DSL spanning SM 7.5–10.0, Tensor Core WMMA/MMA/WGMMA, kernel templates (GEMM, elementwise, reduction, softmax, scan, transpose, attention, batch-norm, MoE, conv), plus register-pressure analysis, dead-code elimination, constant folding, and strength reduction — 29,206 SLoC across 873 tests. oxicuda-autotune defines the search space, benchmarks on-GPU with statistics, drives Bayesian optimization / simulated annealing / genetic search, and dispatches through a 3-tier cached/tuned/default path backed by a disk PTX cache.

BLAS (Vol. 3). oxicuda-blas is the cuBLAS equivalent — L1/L2/L3, GEMM (SIMT / Tensor Core / Split-K), batched variants, precisions F16/BF16/TF32/F32/F64/FP8, elementwise ops, reductions, and epilogue fusion.

DNN (Vol. 4). oxicuda-dnn is the cuDNN equivalent — convolution (implicit GEMM, im2col, Winograd 3×3, direct, fused Conv+BN+Act), FlashAttention v2 forward and backward, PagedAttention, MoE, normalization (BN/LN/RMSNorm/GroupNorm), pooling, resize, and speculative decoding — 34,711 SLoC across 960 tests.

Scientific (Vol. 5, 4 crates). oxicuda-fft mirrors cuFFT (Stockham radix-2/4/8, mixed-radix, Bluestein, C2C/R2C/C2R, 1D/2D/3D). oxicuda-sparse mirrors cuSPARSE (CSR/CSC/COO/BSR/ELL/HYB/CSR5, SpMV/SpMM/SpGEMM/SDDMM, ILU(0)/IC(0), Krylov). oxicuda-solver mirrors cuSOLVER (dense LU/QR/SVD/Cholesky/eig, CG/BiCGSTAB/GMRES, matrix functions). oxicuda-rand mirrors cuRAND (Philox / MRG32k3a / XORWOW / Sobol, with uniform/normal/Poisson/exponential/gamma distributions).

Signal, Graph, Training, Inference, RL (Vols. 6–10). oxicuda-signal (audio MFCC/STFT/Mel, image filters, DCT, DWT, IIR/FIR), oxicuda-graph (CUDA Graph capture and dependency-sorted execution), oxicuda-train + oxicuda-quant (AMP, fused optimizers, INT8/INT4/FP8 quantization), oxicuda-infer + oxicuda-dist-infer + oxicuda-lm (paged KV cache, tensor/pipeline parallelism, tokenizer + sampling), and oxicuda-rl (replay buffers, policy distributions, PPO/DQN/SAC/TD3 losses).

Backends (7 crates). A ComputeBackend trait (oxicuda-backend), CUB-equivalent primitives (oxicuda-primitives), and portability targets for Metal, Vulkan Compute, WebGPU, AMD ROCm/HIP, and Intel Level Zero.

Tying it together, the umbrella oxicuda crate re-exports every subcrate and exposes the ComputeBackend and CudaBackend entry points along with global init and the device pool.

Getting Started

Add the umbrella crate:

cargo add oxicuda

A minimal SGEMM, end to end:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

Default features are driver, memory, and launch — the core needed to move data and launch work. The larger subsystems (blas, dnn, fft, sparse, solver, rand, and the rest) are opt-in feature flags, so you compile only what you use.

What’s inside

This is the first release, so here is the whole shape of it in plain terms:

10 Volumes: Foundation, PTX codegen + autotuner, BLAS, DNN, Scientific (FFT / Sparse / Solver / Rand), Signal, Computation Graph, Training, Inference, and Reinforcement Learning.
7 backends: a ComputeBackend abstraction, CUB-equivalent primitives, and portability layers for Metal, Vulkan, WebGPU, ROCm, and Level Zero.
~238,672 SLoC across 28 crates, with 7,026 passing tests — roughly 239K lines of safe Rust.
Feature-flag design: driver / memory / launch on by default; everything heavier is opt-in.
GPU architectures: Turing (SM 7.5, INT8 Tensor Cores) through Ampere (TF32 / FP64 TC), Ada (FP8 TC), Hopper (WGMMA / TMA / FP8 / DPX), and Blackwell (FP4, 5th-gen Tensor Cores).
Platforms: full on Linux x86_64 and Windows x86_64; macOS compiles but returns UnsupportedPlatform at runtime (compile-only).

Tips

Enable only what you need. Pull in subsystems explicitly to keep build times and binary size down:
```
oxicuda = { version = "0.1.0", features = ["blas", "dnn"] }
```
Turn on the autotuner for per-GPU tuned kernels. The autotune feature benchmarks variants per architecture and persists results in a disk PTX cache, so the cost is paid once and reused.
Build with zero CUDA SDK. There is no nvcc, no SDK, and no C/C++ toolchain to install — cargo build is the entire build step. Only the NVIDIA driver needs to be present at runtime.
On macOS, expect UnsupportedPlatform. The crates compile there so you can develop and run CI, but GPU calls return UnsupportedPlatform at runtime — plan your test gating accordingly.
Use the full feature when you want everything at once and don’t want to manage the flag list by hand.

The foundation of a sovereign GPU stack

OxiCUDA is the GPU layer of the COOLJAPAN ecosystem. Its architecture diagram places SciRS2, OxiONNX, TrustformeRS, and ToRSh directly above it as consumers — scientific computing, ONNX inference, transformers, and tensors that run on top of the GPU layer — while OxiCUDA itself sits on libcuda.so, the NVIDIA driver, at runtime only. Alongside it stand siblings like OxiBLAS and OxiFFT for linear algebra and FFT. The goal is a GPU stack that the whole ecosystem can stand on without reaching for the vendor SDK.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a pure-Rust GPU stack is something you’d build on — it’s the clearest signal that helps us prioritize.

Pure Rust GPU computing is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ April 13, 2026