COOLJAPAN
← All posts

OxiCUDA 0.1.0 Released — A Pure Rust Replacement for the NVIDIA CUDA Toolkit

OxiCUDA 0.1.0 is a pure-Rust, type-safe, memory-safe replacement for the entire NVIDIA CUDA Toolkit software stack — cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more in ~239K lines across 28 crates. The only runtime dependency is the NVIDIA driver. PTX code generation plus a built-in autotuner, all from safe Rust.

release oxicuda cuda gpu-computing pure-rust ptx cublas cudnn tensor-core

The NVIDIA CUDA Toolkit, rebuilt in pure Rust — type-safe, memory-safe, and free of nvcc.

Today we released OxiCUDA 0.1.0 — a pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack, generating optimized PTX directly from Rust data structures with the NVIDIA driver as its only runtime dependency.

No CUDA SDK. No nvcc. No C/C++ toolchain. OxiCUDA loads the NVIDIA driver at runtime (libcuda.so on Linux, nvcuda.dll on Windows) through libloading, and that is the whole of its native footprint. Everything above the driver — the PTX it emits, the kernels it tunes, the BLAS and DNN routines it exposes — is written in safe Rust and compiles with cargo build alone. The result links into a single static binary (or targets WASM), and runs across the modern NVIDIA lineup from Turing through Blackwell.

Why OxiCUDA matters

The CUDA Toolkit is the foundation of modern GPU computing, and it carries the costs of its age. Its libraries are C and C++, with the memory-unsafety that implies. Building against it means dragging in nvcc, the full SDK, and a C/C++ toolchain at build time. It locks you to one vendor’s tooling, it ports poorly, and it is awkward to integrate cleanly with safe Rust — you end up wrapping FFI surfaces and trusting that nothing on the other side misbehaves.

OxiCUDA takes a different path. The kernels are described in a Rust PTX DSL and emitted as optimized PTX assembly; a built-in autotuner benchmarks kernel variants per GPU architecture to push throughput toward the peak of each device. Memory is owned and typed — DeviceBuffer<T> instead of raw pointers — so the borrow checker, not a runtime crash, catches your mistakes. And because the only thing it talks to is the driver, integration with the rest of your safe-Rust stack is direct.

The 0.1.0 README states performance targets against the corresponding CUDA libraries (these are goals the kernels and autotuner aim for, not audited measurements):

Technical Deep Dive: Rebuilding the CUDA Toolkit in Pure Rust

OxiCUDA is organized as 10 Volumes plus 7 backends — 28 crates in total, layered from the driver up to full training and inference stacks.

Foundation (Vol. 1, 4 crates). oxicuda-driver wraps the CUDA Driver API via dynamic loading — devices, contexts, streams, events, modules, a multi-GPU pool, and occupancy queries. oxicuda-memory provides the typed memory model: DeviceBuffer<T>, PinnedBuffer<T>, unified memory, an async pool, virtual memory, and 2D/3D and peer copies. oxicuda-launch carries Dim3, LaunchParams, the launch! macro, and cooperative, cluster (Hopper), and graph launches. oxicuda-runtime is the high-level cudaRT-style surface — streams, events, textures, surfaces.

PTX Codegen + Autotuner (Vol. 2, 2 crates). oxicuda-ptx is a full PTX IR type system with a Rust DSL spanning SM 7.5–10.0, Tensor Core WMMA/MMA/WGMMA, kernel templates (GEMM, elementwise, reduction, softmax, scan, transpose, attention, batch-norm, MoE, conv), plus register-pressure analysis, dead-code elimination, constant folding, and strength reduction — 29,206 SLoC across 873 tests. oxicuda-autotune defines the search space, benchmarks on-GPU with statistics, drives Bayesian optimization / simulated annealing / genetic search, and dispatches through a 3-tier cached/tuned/default path backed by a disk PTX cache.

BLAS (Vol. 3). oxicuda-blas is the cuBLAS equivalent — L1/L2/L3, GEMM (SIMT / Tensor Core / Split-K), batched variants, precisions F16/BF16/TF32/F32/F64/FP8, elementwise ops, reductions, and epilogue fusion.

DNN (Vol. 4). oxicuda-dnn is the cuDNN equivalent — convolution (implicit GEMM, im2col, Winograd 3×3, direct, fused Conv+BN+Act), FlashAttention v2 forward and backward, PagedAttention, MoE, normalization (BN/LN/RMSNorm/GroupNorm), pooling, resize, and speculative decoding — 34,711 SLoC across 960 tests.

Scientific (Vol. 5, 4 crates). oxicuda-fft mirrors cuFFT (Stockham radix-2/4/8, mixed-radix, Bluestein, C2C/R2C/C2R, 1D/2D/3D). oxicuda-sparse mirrors cuSPARSE (CSR/CSC/COO/BSR/ELL/HYB/CSR5, SpMV/SpMM/SpGEMM/SDDMM, ILU(0)/IC(0), Krylov). oxicuda-solver mirrors cuSOLVER (dense LU/QR/SVD/Cholesky/eig, CG/BiCGSTAB/GMRES, matrix functions). oxicuda-rand mirrors cuRAND (Philox / MRG32k3a / XORWOW / Sobol, with uniform/normal/Poisson/exponential/gamma distributions).

Signal, Graph, Training, Inference, RL (Vols. 6–10). oxicuda-signal (audio MFCC/STFT/Mel, image filters, DCT, DWT, IIR/FIR), oxicuda-graph (CUDA Graph capture and dependency-sorted execution), oxicuda-train + oxicuda-quant (AMP, fused optimizers, INT8/INT4/FP8 quantization), oxicuda-infer + oxicuda-dist-infer + oxicuda-lm (paged KV cache, tensor/pipeline parallelism, tokenizer + sampling), and oxicuda-rl (replay buffers, policy distributions, PPO/DQN/SAC/TD3 losses).

Backends (7 crates). A ComputeBackend trait (oxicuda-backend), CUB-equivalent primitives (oxicuda-primitives), and portability targets for Metal, Vulkan Compute, WebGPU, AMD ROCm/HIP, and Intel Level Zero.

Tying it together, the umbrella oxicuda crate re-exports every subcrate and exposes the ComputeBackend and CudaBackend entry points along with global init and the device pool.

Getting Started

Add the umbrella crate:

cargo add oxicuda

A minimal SGEMM, end to end:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

Default features are driver, memory, and launch — the core needed to move data and launch work. The larger subsystems (blas, dnn, fft, sparse, solver, rand, and the rest) are opt-in feature flags, so you compile only what you use.

What’s inside

This is the first release, so here is the whole shape of it in plain terms:

Tips

The foundation of a sovereign GPU stack

OxiCUDA is the GPU layer of the COOLJAPAN ecosystem. Its architecture diagram places SciRS2, OxiONNX, TrustformeRS, and ToRSh directly above it as consumers — scientific computing, ONNX inference, transformers, and tensors that run on top of the GPU layer — while OxiCUDA itself sits on libcuda.so, the NVIDIA driver, at runtime only. Alongside it stand siblings like OxiBLAS and OxiFFT for linear algebra and FFT. The goal is a GPU stack that the whole ecosystem can stand on without reaching for the vendor SDK.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a pure-Rust GPU stack is something you’d build on — it’s the clearest signal that helps us prioritize.

Pure Rust GPU computing is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ April 13, 2026

↑ Back to all posts