OxiCUDA 0.1.5 Released — Nine New GPU Deep-Learning Crates (GenAI, GNN, Mamba, ViT, Audio, Time-Series, Bayesian, Federated, NAS)

The CUDA Toolkit, rewritten in safe Rust — and today it grows a whole deep-learning library, from diffusion models to graph nets to Mamba.

Today we released OxiCUDA 0.1.5 — nine new GPU deep-learning crates land in a single release, carrying generative diffusion, graph neural networks, Mamba state-space models, vision transformers, audio and speech ML, time-series forecasting, Bayesian deep learning, federated learning, and neural-architecture search into the same Pure Rust GPU stack.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. OxiCUDA remains a type-safe, memory-safe replacement for the entire NVIDIA CUDA Toolkit software stack (cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more), and the only thing it needs at runtime is the NVIDIA driver (libcuda.so / nvcuda.dll). The very PTX assembly that runs on the GPU is generated directly from Rust data structures. OxiCUDA compiles into a single static binary — or a WASM module — and the same code runs on Turing through Blackwell.

Why OxiCUDA 0.1.5 is a game changer

OxiCUDA started life as the low-level CUDA-Toolkit replacement: the driver wrappers, the PTX codegen, the autotuner, the BLAS and cuDNN-equivalent kernels, the scientific-computing suite. 0.1.5 keeps every bit of that foundation and expands the high-level deep-learning surface dramatically. The headline numbers tell the story:

9 new crates join the workspace, taking it from 28 to 37 crates.
~320K lines of safe Rust now make up the codebase, up from ~260K at 0.1.4.
9,568 passing tests across the workspace (up from roughly 9,000), with 2 GPU-gated tests skipped on macOS where there is no NVIDIA device to run them.

The nine newcomers are standalone leaf crates: each one carries only thiserror as a dependency, so the deep-learning domains are 100% Pure Rust and lightweight to pull in on their own. Alongside the new surface, this release also lands a round of engineering-quality work that makes the whole workspace more dependable:

A macOS stub integration test suite — 9 tests asserting that the GPU paths return UnsupportedPlatform / NotInitialized on a machine with no NVIDIA hardware, so the no-GPU contract is verified rather than assumed.
[package.metadata.docs.rs] added to all 34 subcrate Cargo.toml files, so cargo doc --all-features builds cleanly and every crate renders properly on docs.rs.
22 clippy warnings repaired without a single #[allow] — the lints were fixed, not silenced.

Technical Deep Dive: Nine New Deep-Learning Crates

Each new crate is a focused leaf crate sitting on the existing foundation. Here is the surface they bring:

oxicuda-gen — Generative AI: DDPM, DDIM, DPM-Solver++, and Flow Matching schedulers, classifier-free guidance, a VAE codec, and LoRA adapters.
oxicuda-gnn — Graph neural networks: CSR/COO and heterogeneous graph representations, scatter/gather/aggregate primitives, GCN / GAT / GAT-v2 / GraphSAGE / GIN layers, global / Top-K / DiffPool pooling, and Set2Set.
oxicuda-mamba — State-space models: HiPPO-NPLR initialization, S4D and S5 selective scan, the Mamba block, RWKV channel-mixing, and a gated SSM.
oxicuda-vision — Vision transformers and CLIP: patch embedding, ViT encoder blocks, learnable positional embeddings, a CLS token, and CLIP image and text towers.
oxicuda-audio — Audio and speech ML: a Conformer encoder, a Wav2Vec2 feature extractor, CTC and RNN-T loss, a WaveNet causal stack, SpecAugment, and x-vector.
oxicuda-timeseries — Forecasting: TCN, NHiTS, PatchTST, TimesNet, iTransformer, and RevIN.
oxicuda-bayes — Bayesian deep learning: variational inference, Bayesian linear and conv layers, Flipout, ELBO/IWAE objectives, normalizing flows, MC Dropout, Deep Ensembles, SWAG, Laplace approximation, and calibration / ECE.
oxicuda-federated — Federated learning: FedAvg / FedProx / SCAFFOLD / FedAdam aggregation, PowerSGD / QSGD / Top-K / Random-K gradient compression, Gaussian / Laplacian / Moments / RDP / PATE differential privacy, and Shamir secure aggregation.
oxicuda-nas — Neural-architecture search: the DARTS bilevel optimizer, one-shot weight-shared Supernet / Slimmable training, evolutionary NSGA-II, and a hardware-aware FLOPs predictor.

All nine sit on top of the existing 10-Volume foundation — the PTX codegen and autotuner, the BLAS and DNN kernels, and the FFT / Sparse / Solver / Rand scientific suite — and the 7 GPU backends (the ComputeBackend trait, CUB-equivalent primitives, plus Metal, Vulkan, WebGPU, ROCm, and Level Zero) that keep the whole stack portable beyond NVIDIA hardware.

Getting Started

Add the umbrella crate:

cargo add oxicuda

A minimal GEMM end to end looks like this:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The driver, memory, and launch features are on by default; everything above them — blas, dnn, fft, sparse, solver, rand, ptx, autotune, the alternate backends, and more — is opt-in. The nine new deep-learning domains are available as standalone leaf crates behind their own feature flags, so you can reach for the oxicuda-gen DDIM scheduler, an oxicuda-gnn GCN layer, or the oxicuda-mamba block without dragging in the rest of the workspace.

What’s New in 0.1.5

Nine new Pure Rust deep-learning crates (Vol.17–25): oxicuda-gen (generative), oxicuda-gnn (graphs), oxicuda-mamba (state-space), oxicuda-vision (transformers), oxicuda-audio (speech), oxicuda-timeseries (forecasting), oxicuda-bayes (Bayesian), oxicuda-federated (federated), and oxicuda-nas (architecture search). Each carries only thiserror and is a standalone leaf crate.
macOS stub integration test suite — 9 tests asserting UnsupportedPlatform / NotInitialized on machines with no NVIDIA device.
docs.rs metadata across all 34 subcrates — [package.metadata.docs.rs] added everywhere so cargo doc --all-features builds clean and renders correctly.
8 missing per-crate README.md files added, so every crate documents itself.
Preemptive splitrs of 5 near-cap files, keeping source files under the size limit before they become a problem.
Internal dependency versions bumped to 0.1.5 for a fully synchronized workspace.
9,568 passing tests (2 skipped, GPU-gated on macOS), up from roughly 9,000.
22 clippy warnings fixed without #[allow] — repaired, not suppressed.
6 pre-existing compile errors fixed.
A statistical-test flake retuned by raising n_trials from 500 to 5000 for a stable result.

Tips

Pull the new crates standalone. Each of the nine deep-learning crates depends only on thiserror, so adding one — say oxicuda-gnn for a graph model — is lightweight and won’t drag the whole toolkit into your build.
Pick the right crate for the task. Reach for oxicuda-gen for diffusion, oxicuda-gnn for graphs, oxicuda-mamba for long-sequence state-space models, oxicuda-vision for ViT and CLIP, oxicuda-bayes for uncertainty and calibration, oxicuda-federated for privacy-preserving training, and oxicuda-nas for architecture search.
Browse the docs with --all-features. With the new docs.rs metadata in place, every crate now renders cleanly on docs.rs, so cargo doc --all-features is the fastest way to find the entry point you need.
Develop on macOS with confidence. The GPU paths return UnsupportedPlatform on Apple machines, and the new stub test suite guarantees exactly that behavior — so you can build and unit-test the host-side code without an NVIDIA device, then deploy to Linux or Windows for the GPU run.
Budget your privacy with oxicuda-federated. When you need privacy-preserving training, the crate’s differential-privacy primitives — RDP, PATE, and the Moments accountant — let you track and bound a privacy budget alongside the FedAvg / FedProx / SCAFFOLD aggregators.
Build with cargo build alone. There is no CUDA SDK, nvcc, or C/C++ toolchain to install; the only runtime requirement is the NVIDIA driver.

Part of a sovereign GPU stack

OxiCUDA is the GPU compute layer beneath the rest of the COOLJAPAN ecosystem. Above it, SciRS2, OxiONNX, TrustformeRS, and ToRSh consume OxiCUDA directly as their GPU backend. Alongside it, OxiBLAS and OxiFFT serve as pure-Rust linear-algebra and FFT siblings, OxiLLaMa builds LLM inference on this foundation, OptiRS handles optimization and training, and OxiEML rounds out the applied-ML neighborhood. OxiRouter ships today as well, joining the same family. The whole stack rests on one runtime dependency — the NVIDIA driver — with no proprietary toolkit underneath.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a CUDA stack you can build with nothing but a Rust compiler — now with a full GPU deep-learning library on top — sounds like the future you want.

Pure Rust GPU deep learning is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ May 3, 2026