OxiCUDA 0.1.4 Released — Continued Quality and Documentation Polish

The polish continues — one tidy step a day keeps a 28-crate workspace honest.

Today we released OxiCUDA 0.1.4 — a maintenance release with documentation and quality improvements across all crates.

OxiCUDA replaces the entire NVIDIA CUDA Toolkit software stack with type-safe, memory-safe Rust. No CUDA SDK. No nvcc. No C/C++ toolchain at build time — cargo build is the whole story. The only runtime dependency is the NVIDIA driver (libcuda.so / nvcuda.dll), and PTX is generated and autotuned to run near peak from Turing through Blackwell.

Why 0.1.4 matters

Let’s be candid: this is a small, steady release, the day after 0.1.3, squarely in the early-life hardening phase. There is no new public surface here. What there is, is discipline — the unglamorous kind of work that makes a 28-crate workspace dependable.

Documentation and quality improvements across all crates. More of the same daily polish: clearer docs, cleaner internals.
All internal dependency versions aligned to 0.1.4. The whole workspace now ships in lockstep — every crate references its siblings at exactly the same version, so there is no drift to reason about when you pull OxiCUDA into a project.
The architecture is unchanged. It remains the stable ~260K-line, 28-crate, 10-Volume + 7-backend stack you got in 0.1.3.

If 0.1.3 was about closing the version sync, 0.1.4 is about keeping the cadence: one tidy release a day, release hygiene tightened, nothing surprising.

What’s stable

Since there is little new to deep-dive, here is the architecture you can rely on today:

Foundation — oxicuda-driver (device/context/stream/event/module, multi-GPU pool), oxicuda-memory (DeviceBuffer<T>, PinnedBuffer<T>, unified memory, async pool), oxicuda-launch (Dim3, LaunchParams, the launch! macro), oxicuda-runtime (high-level streams/events/texture/surface).
Codegen — oxicuda-ptx (full PTX IR + Rust DSL, Tensor Core WMMA/MMA/WGMMA, GEMM/attention/reduction templates) and oxicuda-autotune (per-arch kernel benchmarking, Bayesian/SA/GA search, cached PTX dispatch).
BLAS + DNN — oxicuda-blas (cuBLAS-equivalent L1/L2/L3, Tensor Core GEMM, batched, F16/BF16/TF32/F32/F64/FP8) and oxicuda-dnn (cuDNN-equivalent conv, FlashAttention v2, PagedAttention, MoE, norms).
Scientific — oxicuda-fft (cuFFT), oxicuda-sparse (cuSPARSE), oxicuda-solver (cuSOLVER), oxicuda-rand (cuRAND).
Signal / Graph / Training / Inference / RL — oxicuda-signal, oxicuda-graph, oxicuda-train + oxicuda-quant, oxicuda-infer + oxicuda-dist-infer + oxicuda-lm, and oxicuda-rl.
7 GPU backends — NVIDIA plus oxicuda-metal, oxicuda-vulkan, oxicuda-webgpu, oxicuda-rocm, and oxicuda-levelzero.

The standing performance targets are unchanged too — SGEMM ≥95% of cuBLAS, HGEMM ≥95% (Tensor Core), FFT pow2 ≥90% of cuFFT, SpMV CSR ≥85% of cuSPARSE, LU/QR/SVD ≥85% of cuSOLVER. These are targets we build toward, not a benchmark sheet.

Getting Started

cargo add oxicuda

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The default features (driver, memory, launch) give you device init and buffers out of the box. Everything heavier — blas, dnn, fft, sparse, and friends — is opt-in, so you pull in only what you use.

What’s New in 0.1.4

Documentation and quality improvements across all crates.
All internal dependency versions bumped to 0.1.4.

That’s the whole list. No new APIs — just polish and version alignment.

Tips

Pin one consistent OxiCUDA version across your dependencies. The workspace is version-locked, so matching the umbrella crate and any direct subcrate references to the same 0.1.4 keeps resolution clean.
Enable only the feature flags you need. Compile times and binary size both thank you.
```
oxicuda = { version = "0.1.4", features = ["blas", "dnn"] }
```
Reach for full when you’re experimenting. It pulls everything in so you can poke at the whole stack without curating flags.
```
oxicuda = { version = "0.1.4", features = ["full"] }
```
Build without the CUDA SDK. cargo build is all you need — no nvcc, no C toolchain, no pkg-config. The driver is only required at runtime.

Part of a sovereign GPU stack

OxiCUDA does not stand alone. SciRS2, OxiONNX, TrustformeRS, and ToRSh all consume it as their GPU layer, while OxiBLAS and OxiFFT are sibling libraries for dense linear algebra and FFT. For LLM workloads, OxiLLaMa builds on top of the same foundation, and OxiEML rounds out the early-life ML tooling — all pure Rust, all part of the same C/C++/Fortran-free push.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a CUDA stack you can build with nothing but cargo sounds like your kind of thing — and follow along, because the daily polish keeps rolling.

— KitaSan at COOLJAPAN OÜ April 18, 2026