COOLJAPAN
← All posts

OxiCUDA 0.2.0 Released — Adaptive RK45, Topological Data Analysis, and a Zero-Unwrap Workspace

OxiCUDA 0.2.0, the pure-Rust replacement for the NVIDIA CUDA Toolkit, lands the 'Wave AAA+64' expansion: adaptive RK45 with Richardson extrapolation, Extended Persistence and Discrete Morse theory (oxicuda-tda), Parametric UMAP, and Fisher Information estimation — plus a workspace-wide zero-unwrap reliability pass and 32,320 passing tests. ~783K lines across 73 crates. No CUDA SDK, no nvcc.

release oxicuda cuda gpu-computing pure-rust tda ode-solver umap reliability

The entire CUDA Toolkit, rewritten in safe Rust — and today it grows up.

Today we released OxiCUDA 0.2.0 — the “Wave AAA+64” feature expansion, bringing adaptive RK45 numerical integration, topological data analysis, and Parametric UMAP to the GPU, all on top of a workspace-wide zero-unwrap() reliability pass.

No CUDA SDK. No nvcc. No C/C++ toolchain. OxiCUDA is a type-safe, memory-safe, pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack — cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more. The only thing it needs at runtime is the NVIDIA driver itself (libcuda.so / nvcuda.dll). PTX is generated directly from Rust, a built-in autotuner specializes kernels per GPU architecture from Turing through Blackwell, and the whole thing compiles to a single static binary — or to WASM, or onto multi-vendor backends — without a single line of C++ in your build.

Why OxiCUDA 0.2.0 is a game changer

The classic CUDA Toolkit is a marvel of engineering wrapped in decades of pain. To use it you accept C and C++ in your hot path — which means segfaults, dangling device pointers, and silent buffer overruns that only show up as corrupted results three layers downstream. You accept nvcc as a build dependency, dragging a heavyweight C/C++ toolchain into every CI pipeline. And you accept lock-in: your kernels are welded to one vendor’s hardware, with no portable escape hatch.

OxiCUDA 0.2.0 keeps the speed and throws the pain away. The concrete wins in this release:

And the tests earn their keep. While hardening for 0.2.0, the suite surfaced a genuine numerical bug in oxicuda-geometry3d: the symmetric-3×3 Jacobi eigensolver used app - aqq where it needed aqq - app, which doubled the off-diagonal element on every sweep instead of annihilating it. Obb::fit_pca had been returning principal axes tilted off true and producing loose oriented bounding boxes. Memory-safe Rust plus a real test suite caught it; 0.2.0 fixes it.

Technical Deep Dive: how 0.2.0 is built

Numerical solvers. The adaptive integrator lives in oxicuda-numericsrc/ode/rk45.rs for the embedded RK4(5) stepper and src/diff/richardson_extrapolation.rs for the extrapolation layer. The pair gives you both step-size control and a principled error estimate, and it feeds directly into the heavier ODE/PDE machinery in oxicuda-pde. Instead of hand-tuning a fixed step and praying, you hand the solver a tolerance and let it adapt across regions of high and low curvature.

Scientific and topology layer. oxicuda-tda carries the homology, persistence, Morse, and mapper modules behind Extended Persistence and Discrete Morse theory. oxicuda-manifold carries Parametric UMAP for dimensionality reduction and the Fisher Information estimation used by the Bayesian and information-geometry tooling. These are not toys bolted on — they ride the same device buffers, streams, and BLAS primitives as the rest of the stack.

The reliability and quality pass. This release made a workspace-wide sweep across every crate: no .unwrap() anywhere under crates/*/src, descriptive .expect(...) messages where a failure is genuinely unrecoverable, and zero clippy warnings under -D warnings. The OBB Jacobi-eigensolver fix in crates/oxicuda-geometry3d/src/mesh/obb.rs is the concrete payoff: the kind of subtle sign error that hides for years in a C++ codebase, caught and corrected here.

The GPU foundation it rides on. Underneath it all sits the foundation that makes any of this possible — oxicuda-driver, oxicuda-memory, and oxicuda-launch for the core runtime, a PTX DSL in oxicuda-ptx targeting SM 7.5 through 10.0 (Turing to Blackwell) with Tensor Core generations handled per arch, and the oxicuda-autotune autotuner that specializes kernels to whatever GPU it finds. On top, oxicuda-blas and oxicuda-dnn provide the dense-linear-algebra and neural-network primitives the science crates lean on. 0.2.0 also expands raw CUDA kernel coverage across the driver, memory, launch, and backend layers.

Getting Started

Add OxiCUDA and opt into the subsystems you need:

cargo add oxicuda --features blas

Default features are driver, memory, and launch — the foundation you almost always want. Each subsystem is a feature flag (blas, dnn, fft, sparse, solver, rand, autotune, ptx, and full for everything), so you pull in only what your binary actually uses.

A complete GEMM, end to end:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;
    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32, &d_a, lda, &d_b, ldb,
        0.0f32, &mut d_c, ldc,
    )?;
    stream.synchronize()?;

    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

No nvcc invocation, no .cu files, no linker flags chasing a CUDA install. cargo build, and you ship.

What’s New in 0.2.0

The “Wave AAA+64” expansion:

Tips

This is the foundation

OxiCUDA is the mature GPU layer beneath the rest of the COOLJAPAN ecosystem. When SciRS2 and NumRS2 crunch numbers, when ToRSh and TrustformeRS train and run models, when OxiLLaMa serves language models, when OxiONNX executes graphs, when OxiBLAS and OxiFFT provide linear algebra and transforms, and when OxiPhysics and OptiRS simulate and optimize — OxiCUDA is what carries the work to the GPU, in pure Rust, from Turing to Blackwell and onto multi-vendor backends besides. 0.2.0 makes that foundation deeper, broader, and safer.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if you believe GPU computing should be safe, portable, and free of vendor toolchains. Every star tells us to keep building.

Pure Rust GPU computing is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ June 17, 2026

↑ Back to all posts