COOLJAPAN
← All posts

OxiCUDA 0.1.4 Released — Continued Quality and Documentation Polish

A small maintenance release for OxiCUDA, the pure-Rust replacement for the NVIDIA CUDA Toolkit. Workspace-wide documentation and quality improvements, with all 28 crates aligned to 0.1.4 so the stack ships in lockstep. The only runtime dependency is the NVIDIA driver.

release oxicuda cuda gpu-computing pure-rust documentation maintenance

The polish continues — one tidy step a day keeps a 28-crate workspace honest.

Today we released OxiCUDA 0.1.4 — a maintenance release with documentation and quality improvements across all crates.

OxiCUDA replaces the entire NVIDIA CUDA Toolkit software stack with type-safe, memory-safe Rust. No CUDA SDK. No nvcc. No C/C++ toolchain at build time — cargo build is the whole story. The only runtime dependency is the NVIDIA driver (libcuda.so / nvcuda.dll), and PTX is generated and autotuned to run near peak from Turing through Blackwell.

Why 0.1.4 matters

Let’s be candid: this is a small, steady release, the day after 0.1.3, squarely in the early-life hardening phase. There is no new public surface here. What there is, is discipline — the unglamorous kind of work that makes a 28-crate workspace dependable.

If 0.1.3 was about closing the version sync, 0.1.4 is about keeping the cadence: one tidy release a day, release hygiene tightened, nothing surprising.

What’s stable

Since there is little new to deep-dive, here is the architecture you can rely on today:

The standing performance targets are unchanged too — SGEMM ≥95% of cuBLAS, HGEMM ≥95% (Tensor Core), FFT pow2 ≥90% of cuFFT, SpMV CSR ≥85% of cuSPARSE, LU/QR/SVD ≥85% of cuSOLVER. These are targets we build toward, not a benchmark sheet.

Getting Started

cargo add oxicuda
use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The default features (driver, memory, launch) give you device init and buffers out of the box. Everything heavier — blas, dnn, fft, sparse, and friends — is opt-in, so you pull in only what you use.

What’s New in 0.1.4

That’s the whole list. No new APIs — just polish and version alignment.

Tips

Part of a sovereign GPU stack

OxiCUDA does not stand alone. SciRS2, OxiONNX, TrustformeRS, and ToRSh all consume it as their GPU layer, while OxiBLAS and OxiFFT are sibling libraries for dense linear algebra and FFT. For LLM workloads, OxiLLaMa builds on top of the same foundation, and OxiEML rounds out the early-life ML tooling — all pure Rust, all part of the same C/C++/Fortran-free push.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a CUDA stack you can build with nothing but cargo sounds like your kind of thing — and follow along, because the daily polish keeps rolling.

KitaSan at COOLJAPAN OÜ April 18, 2026

↑ Back to all posts