The entire CUDA Toolkit, rewritten in safe Rust — and today it grows up.
Today we released OxiCUDA 0.2.0 — the “Wave AAA+64” feature expansion, bringing adaptive RK45 numerical integration, topological data analysis, and Parametric UMAP to the GPU, all on top of a workspace-wide zero-unwrap() reliability pass.
No CUDA SDK. No nvcc. No C/C++ toolchain. OxiCUDA is a type-safe, memory-safe, pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack — cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more. The only thing it needs at runtime is the NVIDIA driver itself (libcuda.so / nvcuda.dll). PTX is generated directly from Rust, a built-in autotuner specializes kernels per GPU architecture from Turing through Blackwell, and the whole thing compiles to a single static binary — or to WASM, or onto multi-vendor backends — without a single line of C++ in your build.
Why OxiCUDA 0.2.0 is a game changer
The classic CUDA Toolkit is a marvel of engineering wrapped in decades of pain. To use it you accept C and C++ in your hot path — which means segfaults, dangling device pointers, and silent buffer overruns that only show up as corrupted results three layers downstream. You accept nvcc as a build dependency, dragging a heavyweight C/C++ toolchain into every CI pipeline. And you accept lock-in: your kernels are welded to one vendor’s hardware, with no portable escape hatch.
OxiCUDA 0.2.0 keeps the speed and throws the pain away. The concrete wins in this release:
- Adaptive RK45 with Richardson extrapolation. Embedded Runge–Kutta 4(5) with adaptive step-size control, paired with Richardson extrapolation for higher-order accuracy and a free error estimate. Stiff or variable-curvature ODE/PDE systems no longer force you to pick between accuracy and speed — the integrator finds the step size for you.
- Extended Persistence and Discrete Morse theory.
oxicuda-tdabrings real topological data analysis to the GPU — homology, persistence, Morse-theoretic simplification, and mapper — for shape and feature analysis of point clouds. - Parametric UMAP.
oxicuda-manifoldgains a parametric, out-of-sample-capable UMAP embedding, going beyond the transductive original. - Fisher Information estimation. Information-geometry tooling for Bayesian and curvature-aware workflows, estimating the Fisher Information directly on device.
- A zero-
unwrap()workspace under-D warnings. Every.unwrap()is gone from allcrates/*/src/— production code and test modules alike now use descriptive.expect(...), with zero clippy warnings. Library calls returnResult, not landmines. - 32,320 passing tests — up from 23,535 at 0.1.8. ~783K lines of safe Rust across 73 crates.
And the tests earn their keep. While hardening for 0.2.0, the suite surfaced a genuine numerical bug in oxicuda-geometry3d: the symmetric-3×3 Jacobi eigensolver used app - aqq where it needed aqq - app, which doubled the off-diagonal element on every sweep instead of annihilating it. Obb::fit_pca had been returning principal axes tilted off true and producing loose oriented bounding boxes. Memory-safe Rust plus a real test suite caught it; 0.2.0 fixes it.
Technical Deep Dive: how 0.2.0 is built
Numerical solvers. The adaptive integrator lives in oxicuda-numeric — src/ode/rk45.rs for the embedded RK4(5) stepper and src/diff/richardson_extrapolation.rs for the extrapolation layer. The pair gives you both step-size control and a principled error estimate, and it feeds directly into the heavier ODE/PDE machinery in oxicuda-pde. Instead of hand-tuning a fixed step and praying, you hand the solver a tolerance and let it adapt across regions of high and low curvature.
Scientific and topology layer. oxicuda-tda carries the homology, persistence, Morse, and mapper modules behind Extended Persistence and Discrete Morse theory. oxicuda-manifold carries Parametric UMAP for dimensionality reduction and the Fisher Information estimation used by the Bayesian and information-geometry tooling. These are not toys bolted on — they ride the same device buffers, streams, and BLAS primitives as the rest of the stack.
The reliability and quality pass. This release made a workspace-wide sweep across every crate: no .unwrap() anywhere under crates/*/src, descriptive .expect(...) messages where a failure is genuinely unrecoverable, and zero clippy warnings under -D warnings. The OBB Jacobi-eigensolver fix in crates/oxicuda-geometry3d/src/mesh/obb.rs is the concrete payoff: the kind of subtle sign error that hides for years in a C++ codebase, caught and corrected here.
The GPU foundation it rides on. Underneath it all sits the foundation that makes any of this possible — oxicuda-driver, oxicuda-memory, and oxicuda-launch for the core runtime, a PTX DSL in oxicuda-ptx targeting SM 7.5 through 10.0 (Turing to Blackwell) with Tensor Core generations handled per arch, and the oxicuda-autotune autotuner that specializes kernels to whatever GPU it finds. On top, oxicuda-blas and oxicuda-dnn provide the dense-linear-algebra and neural-network primitives the science crates lean on. 0.2.0 also expands raw CUDA kernel coverage across the driver, memory, launch, and backend layers.
Getting Started
Add OxiCUDA and opt into the subsystems you need:
cargo add oxicuda --features blas
Default features are driver, memory, and launch — the foundation you almost always want. Each subsystem is a feature flag (blas, dnn, fft, sparse, solver, rand, autotune, ptx, and full for everything), so you pull in only what your binary actually uses.
A complete GEMM, end to end:
use oxicuda::prelude::*;
fn main() -> Result<(), oxicuda::Error> {
let device = Device::get(0)?;
let ctx = Context::new(device)?;
let stream = Stream::new(&ctx)?;
let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;
d_a.copy_from_host(&host_a)?;
d_b.copy_from_host(&host_b)?;
let handle = BlasHandle::new(&stream)?;
handle.gemm(
Transpose::None, Transpose::None,
m, n, k,
1.0f32, &d_a, lda, &d_b, ldb,
0.0f32, &mut d_c, ldc,
)?;
stream.synchronize()?;
let mut result = vec![0.0f32; m * n];
d_c.copy_to_host(&mut result)?;
Ok(())
}
No nvcc invocation, no .cu files, no linker flags chasing a CUDA install. cargo build, and you ship.
What’s New in 0.2.0
The “Wave AAA+64” expansion:
- Adaptive RK45 + Richardson extrapolation — embedded Runge–Kutta 4(5) with adaptive step-size control and an extrapolation layer for higher-order accuracy and error estimation (
oxicuda-numeric). - Extended Persistence + Discrete Morse theory — topological data analysis on the GPU via
oxicuda-tda. - Parametric UMAP — out-of-sample-capable dimensionality reduction in
oxicuda-manifold. - Fisher Information estimation — information-geometry tooling for Bayesian and curvature-aware workflows.
- Expanded CUDA kernel coverage — broader kernels across the driver, memory, launch, and backend layers.
- Zero-
unwrap()reliability pass — every.unwrap()removed from allcrates/*/src(production and tests), replaced with descriptive.expect(...), zero clippy warnings under-D warnings. - Geometry3D fix — corrected a sign error in the symmetric-3×3 Jacobi eigensolver so
Obb::fit_pcaonce again returns tight, correctly-oriented bounding boxes. - Test suite grew to 32,320 passing tests, up from 23,535 at 0.1.8.
Tips
-
Reach for adaptive RK45 instead of fixed-step integration. For stiff or variable-curvature ODEs, hand the embedded RK4(5) integrator a tolerance and let it size each step — you stop trading accuracy against runtime by hand.
-
Let Richardson extrapolation work for you. It buys roughly an extra order of accuracy and a free error estimate from the same evaluations; use that estimate to decide when a result has actually converged.
-
Use Extended Persistence for point-cloud structure. When you need to characterize the shape of data — loops, voids, connected components —
oxicuda-tda’s Extended Persistence and Discrete Morse tooling beats ad-hoc thresholding. -
Prefer Parametric UMAP when you need to embed new points. Unlike transductive UMAP, the parametric variant learns a mapping you can apply to out-of-sample data later.
-
Lean on the zero-
unwrap()guarantee. Library calls returnResult, so you can?-propagate cleanly rather than guarding against hidden panics:let device = Device::get(0)?; let ctx = Context::new(device)?; -
Enable only the features you need. Defaults give you
driver/memory/launch; addblas,dnn,fftand friends à la carte to keep build times and binary size down.
This is the foundation
OxiCUDA is the mature GPU layer beneath the rest of the COOLJAPAN ecosystem. When SciRS2 and NumRS2 crunch numbers, when ToRSh and TrustformeRS train and run models, when OxiLLaMa serves language models, when OxiONNX executes graphs, when OxiBLAS and OxiFFT provide linear algebra and transforms, and when OxiPhysics and OptiRS simulate and optimize — OxiCUDA is what carries the work to the GPU, in pure Rust, from Turing to Blackwell and onto multi-vendor backends besides. 0.2.0 makes that foundation deeper, broader, and safer.
Repository: https://github.com/cool-japan/oxicuda
Star the repo if you believe GPU computing should be safe, portable, and free of vendor toolchains. Every star tells us to keep building.
Pure Rust GPU computing is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ June 17, 2026