COOLJAPAN
← All posts

OxiCUDA 0.1.3 Released — Documentation and Quality Hardening Across All Crates

A quality-and-docs maintenance release for the pure-Rust NVIDIA CUDA Toolkit replacement — workspace-wide polish, internal version alignment to 0.1.3, and continued growth to ~260K lines of safe Rust across 28 crates. The only runtime dependency is still the NVIDIA driver.

release oxicuda cuda gpu-computing pure-rust documentation quality

Steady cadence beats a flashy changelog: the days after a debut are for hardening, not new surface area.

Today we released OxiCUDA 0.1.3 — a documentation-and-quality maintenance release that polishes the docs, refines internals, and synchronizes versions across all crates.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. OxiCUDA remains a pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack, and its only runtime dependency is the NVIDIA driver (libcuda.so / nvcuda.dll). The same workspace covers Turing through Blackwell, all from type-safe, memory-safe Rust.

Why 0.1.3 matters

A debut answers “does it build and run?” The releases that follow answer “is it dependable?” That is what 0.1.3 is about. In the short window since the 0.1.0 debut on April 13, the focus has been production hardening rather than expanding the API surface. Concretely, this release brings:

This is also the bar the hardening protects. OxiCUDA’s standing performance targets — SGEMM FP32 at ≥95% of cuBLAS, HGEMM FP16 at ≥95% of cuBLAS via Tensor Core WMMA/MMA, batched GEMM at ≥95% (Stream-K), Conv FP16 at ≥90% of cuDNN, FlashAttention at ≥90% of FA2, power-of-two FFT at ≥90% of cuFFT, SpMV CSR at ≥85% of cuSPARSE, and dense LU/QR/SVD at ≥85% of cuSOLVER — are the contract a maintenance release exists to keep intact while the code matures.

Technical Deep Dive

Nothing structural changed in 0.1.3, which is the point. The stable 10-Volume, 28-crate architecture keeps building cleanly with synchronized versions and refreshed docs:

Foundation and codegen. Volume 1 (driver, memory, launch, runtime) wraps the CUDA Driver API through libloading, manages DeviceBuffer<T> / PinnedBuffer<T> and the launch! macro. Volume 2 generates optimized PTX directly from Rust data structures (SM 7.5–10.0, Tensor Core WMMA/MMA/WGMMA) and ships the autotuner that benchmarks kernel variants per GPU arch with a 3-tier cached/tuned/default dispatch.

Compute libraries. BLAS (Volume 3, full cuBLAS equivalent) and DNN (Volume 4, full cuDNN equivalent with FlashAttention v2, PagedAttention, MoE) sit on top, alongside the Scientific suite of Volume 5 — FFT, Sparse, Solver, and Rand — covering the cuFFT / cuSPARSE / cuSOLVER / cuRAND surface.

Higher-level and portable. Volumes 6–10 add Signal, Computational Graph, Training, Inference, and RL, while the 7 backends (the ComputeBackend trait, CUB-equivalent primitives, plus Metal, Vulkan, WebGPU, ROCm, and Level Zero) keep the stack portable. A maintenance release like this one keeps every piece of that compiling cleanly together with version numbers in lockstep and documentation refreshed.

Getting Started

Add the umbrella crate:

cargo add oxicuda

A minimal GEMM looks like this:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The driver, memory, and launch features are on by default. Everything else — blas, dnn, fft, sparse, solver, rand, ptx, autotune, the alternate backends, and more — is opt-in, so you only pull in what you use.

What’s New in 0.1.3

Tips

Part of a sovereign GPU stack

OxiCUDA is the GPU compute layer of the COOLJAPAN ecosystem. Above it, SciRS2, OxiONNX, TrustformeRS, and ToRSh consume OxiCUDA directly as their GPU backend. Alongside it, OxiBLAS and OxiFFT serve as pure-Rust linear-algebra and FFT siblings, while OxiLLaMa (shipped April 15) builds LLM inference on this kind of foundation and OxiEML (April 14) rounds out the applied-ML neighborhood. The whole stack rests on one runtime dependency — the NVIDIA driver — with no proprietary toolkit underneath.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a pure-Rust CUDA stack with a steady, trustworthy release cadence is something you want to follow. Maintenance releases like this one are the quiet work that makes the big ones safe.

KitaSan at COOLJAPAN OÜ April 17, 2026

↑ Back to all posts