COOLJAPAN
← All posts

OxiCUDA 0.1.5 Released — Nine New GPU Deep-Learning Crates (GenAI, GNN, Mamba, ViT, Audio, Time-Series, Bayesian, Federated, NAS)

The pure-Rust NVIDIA CUDA Toolkit replacement adds nine new GPU deep-learning crates — generative diffusion, graph neural nets, Mamba SSMs, vision transformers, audio/speech, time-series, Bayesian DL, federated learning, and NAS — growing to ~320K lines across 37 crates with 9,568 passing tests. No CUDA SDK, no nvcc.

release oxicuda cuda gpu-computing pure-rust deep-learning diffusion graph-neural-networks mamba vision-transformer

The CUDA Toolkit, rewritten in safe Rust — and today it grows a whole deep-learning library, from diffusion models to graph nets to Mamba.

Today we released OxiCUDA 0.1.5 — nine new GPU deep-learning crates land in a single release, carrying generative diffusion, graph neural networks, Mamba state-space models, vision transformers, audio and speech ML, time-series forecasting, Bayesian deep learning, federated learning, and neural-architecture search into the same Pure Rust GPU stack.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. OxiCUDA remains a type-safe, memory-safe replacement for the entire NVIDIA CUDA Toolkit software stack (cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more), and the only thing it needs at runtime is the NVIDIA driver (libcuda.so / nvcuda.dll). The very PTX assembly that runs on the GPU is generated directly from Rust data structures. OxiCUDA compiles into a single static binary — or a WASM module — and the same code runs on Turing through Blackwell.

Why OxiCUDA 0.1.5 is a game changer

OxiCUDA started life as the low-level CUDA-Toolkit replacement: the driver wrappers, the PTX codegen, the autotuner, the BLAS and cuDNN-equivalent kernels, the scientific-computing suite. 0.1.5 keeps every bit of that foundation and expands the high-level deep-learning surface dramatically. The headline numbers tell the story:

The nine newcomers are standalone leaf crates: each one carries only thiserror as a dependency, so the deep-learning domains are 100% Pure Rust and lightweight to pull in on their own. Alongside the new surface, this release also lands a round of engineering-quality work that makes the whole workspace more dependable:

Technical Deep Dive: Nine New Deep-Learning Crates

Each new crate is a focused leaf crate sitting on the existing foundation. Here is the surface they bring:

All nine sit on top of the existing 10-Volume foundation — the PTX codegen and autotuner, the BLAS and DNN kernels, and the FFT / Sparse / Solver / Rand scientific suite — and the 7 GPU backends (the ComputeBackend trait, CUB-equivalent primitives, plus Metal, Vulkan, WebGPU, ROCm, and Level Zero) that keep the whole stack portable beyond NVIDIA hardware.

Getting Started

Add the umbrella crate:

cargo add oxicuda

A minimal GEMM end to end looks like this:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The driver, memory, and launch features are on by default; everything above them — blas, dnn, fft, sparse, solver, rand, ptx, autotune, the alternate backends, and more — is opt-in. The nine new deep-learning domains are available as standalone leaf crates behind their own feature flags, so you can reach for the oxicuda-gen DDIM scheduler, an oxicuda-gnn GCN layer, or the oxicuda-mamba block without dragging in the rest of the workspace.

What’s New in 0.1.5

Tips

Part of a sovereign GPU stack

OxiCUDA is the GPU compute layer beneath the rest of the COOLJAPAN ecosystem. Above it, SciRS2, OxiONNX, TrustformeRS, and ToRSh consume OxiCUDA directly as their GPU backend. Alongside it, OxiBLAS and OxiFFT serve as pure-Rust linear-algebra and FFT siblings, OxiLLaMa builds LLM inference on this foundation, OptiRS handles optimization and training, and OxiEML rounds out the applied-ML neighborhood. OxiRouter ships today as well, joining the same family. The whole stack rests on one runtime dependency — the NVIDIA driver — with no proprietary toolkit underneath.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a CUDA stack you can build with nothing but a Rust compiler — now with a full GPU deep-learning library on top — sounds like the future you want.

Pure Rust GPU deep learning is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ May 3, 2026

↑ Back to all posts