The CUDA Toolkit, rewritten in safe Rust — and today it grows a whole deep-learning library, from diffusion models to graph nets to Mamba.
Today we released OxiCUDA 0.1.5 — nine new GPU deep-learning crates land in a single release, carrying generative diffusion, graph neural networks, Mamba state-space models, vision transformers, audio and speech ML, time-series forecasting, Bayesian deep learning, federated learning, and neural-architecture search into the same Pure Rust GPU stack.
No CUDA SDK. No nvcc. No C/C++ toolchain at build time. OxiCUDA remains a type-safe, memory-safe replacement for the entire NVIDIA CUDA Toolkit software stack (cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more), and the only thing it needs at runtime is the NVIDIA driver (libcuda.so / nvcuda.dll). The very PTX assembly that runs on the GPU is generated directly from Rust data structures. OxiCUDA compiles into a single static binary — or a WASM module — and the same code runs on Turing through Blackwell.
Why OxiCUDA 0.1.5 is a game changer
OxiCUDA started life as the low-level CUDA-Toolkit replacement: the driver wrappers, the PTX codegen, the autotuner, the BLAS and cuDNN-equivalent kernels, the scientific-computing suite. 0.1.5 keeps every bit of that foundation and expands the high-level deep-learning surface dramatically. The headline numbers tell the story:
- 9 new crates join the workspace, taking it from 28 to 37 crates.
- ~320K lines of safe Rust now make up the codebase, up from ~260K at 0.1.4.
- 9,568 passing tests across the workspace (up from roughly 9,000), with 2 GPU-gated tests skipped on macOS where there is no NVIDIA device to run them.
The nine newcomers are standalone leaf crates: each one carries only thiserror as a dependency, so the deep-learning domains are 100% Pure Rust and lightweight to pull in on their own. Alongside the new surface, this release also lands a round of engineering-quality work that makes the whole workspace more dependable:
- A macOS stub integration test suite — 9 tests asserting that the GPU paths return
UnsupportedPlatform/NotInitializedon a machine with no NVIDIA hardware, so the no-GPU contract is verified rather than assumed. [package.metadata.docs.rs]added to all 34 subcrate Cargo.toml files, socargo doc --all-featuresbuilds cleanly and every crate renders properly on docs.rs.- 22 clippy warnings repaired without a single
#[allow]— the lints were fixed, not silenced.
Technical Deep Dive: Nine New Deep-Learning Crates
Each new crate is a focused leaf crate sitting on the existing foundation. Here is the surface they bring:
- oxicuda-gen — Generative AI: DDPM, DDIM, DPM-Solver++, and Flow Matching schedulers, classifier-free guidance, a VAE codec, and LoRA adapters.
- oxicuda-gnn — Graph neural networks: CSR/COO and heterogeneous graph representations, scatter/gather/aggregate primitives, GCN / GAT / GAT-v2 / GraphSAGE / GIN layers, global / Top-K / DiffPool pooling, and Set2Set.
- oxicuda-mamba — State-space models: HiPPO-NPLR initialization, S4D and S5 selective scan, the Mamba block, RWKV channel-mixing, and a gated SSM.
- oxicuda-vision — Vision transformers and CLIP: patch embedding, ViT encoder blocks, learnable positional embeddings, a CLS token, and CLIP image and text towers.
- oxicuda-audio — Audio and speech ML: a Conformer encoder, a Wav2Vec2 feature extractor, CTC and RNN-T loss, a WaveNet causal stack, SpecAugment, and x-vector.
- oxicuda-timeseries — Forecasting: TCN, NHiTS, PatchTST, TimesNet, iTransformer, and RevIN.
- oxicuda-bayes — Bayesian deep learning: variational inference, Bayesian linear and conv layers, Flipout, ELBO/IWAE objectives, normalizing flows, MC Dropout, Deep Ensembles, SWAG, Laplace approximation, and calibration / ECE.
- oxicuda-federated — Federated learning: FedAvg / FedProx / SCAFFOLD / FedAdam aggregation, PowerSGD / QSGD / Top-K / Random-K gradient compression, Gaussian / Laplacian / Moments / RDP / PATE differential privacy, and Shamir secure aggregation.
- oxicuda-nas — Neural-architecture search: the DARTS bilevel optimizer, one-shot weight-shared Supernet / Slimmable training, evolutionary NSGA-II, and a hardware-aware FLOPs predictor.
All nine sit on top of the existing 10-Volume foundation — the PTX codegen and autotuner, the BLAS and DNN kernels, and the FFT / Sparse / Solver / Rand scientific suite — and the 7 GPU backends (the ComputeBackend trait, CUB-equivalent primitives, plus Metal, Vulkan, WebGPU, ROCm, and Level Zero) that keep the whole stack portable beyond NVIDIA hardware.
Getting Started
Add the umbrella crate:
cargo add oxicuda
A minimal GEMM end to end looks like this:
use oxicuda::prelude::*;
fn main() -> Result<(), oxicuda::Error> {
let device = Device::get(0)?;
let ctx = Context::new(device)?;
let stream = Stream::new(&ctx)?;
let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;
d_a.copy_from_host(&host_a)?;
d_b.copy_from_host(&host_b)?;
let handle = BlasHandle::new(&stream)?;
handle.gemm(
Transpose::None, Transpose::None,
m, n, k,
1.0f32, // alpha
&d_a, lda,
&d_b, ldb,
0.0f32, // beta
&mut d_c, ldc,
)?;
stream.synchronize()?;
let mut result = vec![0.0f32; m * n];
d_c.copy_to_host(&mut result)?;
Ok(())
}
The driver, memory, and launch features are on by default; everything above them — blas, dnn, fft, sparse, solver, rand, ptx, autotune, the alternate backends, and more — is opt-in. The nine new deep-learning domains are available as standalone leaf crates behind their own feature flags, so you can reach for the oxicuda-gen DDIM scheduler, an oxicuda-gnn GCN layer, or the oxicuda-mamba block without dragging in the rest of the workspace.
What’s New in 0.1.5
- Nine new Pure Rust deep-learning crates (Vol.17–25):
oxicuda-gen(generative),oxicuda-gnn(graphs),oxicuda-mamba(state-space),oxicuda-vision(transformers),oxicuda-audio(speech),oxicuda-timeseries(forecasting),oxicuda-bayes(Bayesian),oxicuda-federated(federated), andoxicuda-nas(architecture search). Each carries onlythiserrorand is a standalone leaf crate. - macOS stub integration test suite — 9 tests asserting
UnsupportedPlatform/NotInitializedon machines with no NVIDIA device. - docs.rs metadata across all 34 subcrates —
[package.metadata.docs.rs]added everywhere socargo doc --all-featuresbuilds clean and renders correctly. - 8 missing per-crate
README.mdfiles added, so every crate documents itself. - Preemptive splitrs of 5 near-cap files, keeping source files under the size limit before they become a problem.
- Internal dependency versions bumped to 0.1.5 for a fully synchronized workspace.
- 9,568 passing tests (2 skipped, GPU-gated on macOS), up from roughly 9,000.
- 22 clippy warnings fixed without
#[allow]— repaired, not suppressed. - 6 pre-existing compile errors fixed.
- A statistical-test flake retuned by raising
n_trialsfrom 500 to 5000 for a stable result.
Tips
- Pull the new crates standalone. Each of the nine deep-learning crates depends only on
thiserror, so adding one — sayoxicuda-gnnfor a graph model — is lightweight and won’t drag the whole toolkit into your build. - Pick the right crate for the task. Reach for
oxicuda-genfor diffusion,oxicuda-gnnfor graphs,oxicuda-mambafor long-sequence state-space models,oxicuda-visionfor ViT and CLIP,oxicuda-bayesfor uncertainty and calibration,oxicuda-federatedfor privacy-preserving training, andoxicuda-nasfor architecture search. - Browse the docs with
--all-features. With the new docs.rs metadata in place, every crate now renders cleanly on docs.rs, socargo doc --all-featuresis the fastest way to find the entry point you need. - Develop on macOS with confidence. The GPU paths return
UnsupportedPlatformon Apple machines, and the new stub test suite guarantees exactly that behavior — so you can build and unit-test the host-side code without an NVIDIA device, then deploy to Linux or Windows for the GPU run. - Budget your privacy with
oxicuda-federated. When you need privacy-preserving training, the crate’s differential-privacy primitives — RDP, PATE, and the Moments accountant — let you track and bound a privacy budget alongside the FedAvg / FedProx / SCAFFOLD aggregators. - Build with
cargo buildalone. There is no CUDA SDK,nvcc, or C/C++ toolchain to install; the only runtime requirement is the NVIDIA driver.
Part of a sovereign GPU stack
OxiCUDA is the GPU compute layer beneath the rest of the COOLJAPAN ecosystem. Above it, SciRS2, OxiONNX, TrustformeRS, and ToRSh consume OxiCUDA directly as their GPU backend. Alongside it, OxiBLAS and OxiFFT serve as pure-Rust linear-algebra and FFT siblings, OxiLLaMa builds LLM inference on this foundation, OptiRS handles optimization and training, and OxiEML rounds out the applied-ML neighborhood. OxiRouter ships today as well, joining the same family. The whole stack rests on one runtime dependency — the NVIDIA driver — with no proprietary toolkit underneath.
Repository: https://github.com/cool-japan/oxicuda
Star the repo if a CUDA stack you can build with nothing but a Rust compiler — now with a full GPU deep-learning library on top — sounds like the future you want.
Pure Rust GPU deep learning is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ May 3, 2026