OxiCUDA 0.1.3 Released — Documentation and Quality Hardening Across All Crates

Steady cadence beats a flashy changelog: the days after a debut are for hardening, not new surface area.

Today we released OxiCUDA 0.1.3 — a documentation-and-quality maintenance release that polishes the docs, refines internals, and synchronizes versions across all crates.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. OxiCUDA remains a pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack, and its only runtime dependency is the NVIDIA driver (libcuda.so / nvcuda.dll). The same workspace covers Turing through Blackwell, all from type-safe, memory-safe Rust.

Why 0.1.3 matters

A debut answers “does it build and run?” The releases that follow answer “is it dependable?” That is what 0.1.3 is about. In the short window since the 0.1.0 debut on April 13, the focus has been production hardening rather than expanding the API surface. Concretely, this release brings:

Documentation improvements across all 28 crates — clearer per-crate docs so each Volume reads cleanly on its own.
Internal quality refinements — continued tightening of the existing code paths, not new features bolted on top.
Version alignment — every internal dependency is now pinned to 0.1.3, so the whole workspace moves together.
Continued growth — the codebase reached ~260K lines of safe Rust across 28 crates, up from ~248K at 0.1.1. That delta is hardening and polish, not a new Volume.

This is also the bar the hardening protects. OxiCUDA’s standing performance targets — SGEMM FP32 at ≥95% of cuBLAS, HGEMM FP16 at ≥95% of cuBLAS via Tensor Core WMMA/MMA, batched GEMM at ≥95% (Stream-K), Conv FP16 at ≥90% of cuDNN, FlashAttention at ≥90% of FA2, power-of-two FFT at ≥90% of cuFFT, SpMV CSR at ≥85% of cuSPARSE, and dense LU/QR/SVD at ≥85% of cuSOLVER — are the contract a maintenance release exists to keep intact while the code matures.

Technical Deep Dive

Nothing structural changed in 0.1.3, which is the point. The stable 10-Volume, 28-crate architecture keeps building cleanly with synchronized versions and refreshed docs:

Foundation and codegen. Volume 1 (driver, memory, launch, runtime) wraps the CUDA Driver API through libloading, manages DeviceBuffer<T> / PinnedBuffer<T> and the launch! macro. Volume 2 generates optimized PTX directly from Rust data structures (SM 7.5–10.0, Tensor Core WMMA/MMA/WGMMA) and ships the autotuner that benchmarks kernel variants per GPU arch with a 3-tier cached/tuned/default dispatch.

Compute libraries. BLAS (Volume 3, full cuBLAS equivalent) and DNN (Volume 4, full cuDNN equivalent with FlashAttention v2, PagedAttention, MoE) sit on top, alongside the Scientific suite of Volume 5 — FFT, Sparse, Solver, and Rand — covering the cuFFT / cuSPARSE / cuSOLVER / cuRAND surface.

Higher-level and portable. Volumes 6–10 add Signal, Computational Graph, Training, Inference, and RL, while the 7 backends (the ComputeBackend trait, CUB-equivalent primitives, plus Metal, Vulkan, WebGPU, ROCm, and Level Zero) keep the stack portable. A maintenance release like this one keeps every piece of that compiling cleanly together with version numbers in lockstep and documentation refreshed.

Getting Started

Add the umbrella crate:

cargo add oxicuda

A minimal GEMM looks like this:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The driver, memory, and launch features are on by default. Everything else — blas, dnn, fft, sparse, solver, rand, ptx, autotune, the alternate backends, and more — is opt-in, so you only pull in what you use.

What’s New in 0.1.3

Documentation and quality improvements across all crates.
All internal dependency versions bumped to 0.1.3 for a fully synchronized workspace.
Continued codebase growth to ~260K lines of safe Rust across 28 crates.

Tips

Pin to 0.1.3 to pick up the latest documentation and quality work: oxicuda = "0.1.3".
Enable only the feature flags you need. Defaults stay lean (driver, memory, launch); add blas or dnn explicitly when you need them to keep build times and surface area small.
Turn on autotune for per-GPU tuned kernels — the autotuner benchmarks variants for your arch and caches the winners, so production runs hit the tuned path.
Build with cargo build alone. There is no CUDA SDK, nvcc, or C/C++ toolchain to install; the only runtime requirement is the NVIDIA driver.
Browse the refreshed per-crate docs on docs.rs — each Volume now reads cleanly as a standalone crate, which makes finding the right entry point easier.

Part of a sovereign GPU stack

OxiCUDA is the GPU compute layer of the COOLJAPAN ecosystem. Above it, SciRS2, OxiONNX, TrustformeRS, and ToRSh consume OxiCUDA directly as their GPU backend. Alongside it, OxiBLAS and OxiFFT serve as pure-Rust linear-algebra and FFT siblings, while OxiLLaMa (shipped April 15) builds LLM inference on this kind of foundation and OxiEML (April 14) rounds out the applied-ML neighborhood. The whole stack rests on one runtime dependency — the NVIDIA driver — with no proprietary toolkit underneath.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a pure-Rust CUDA stack with a steady, trustworthy release cadence is something you want to follow. Maintenance releases like this one are the quiet work that makes the big ones safe.

— KitaSan at COOLJAPAN OÜ April 17, 2026