OxiCUDA 0.1.1 Released — New BLAS Activations and Hardened GPU Backends

The pure-Rust CUDA stack gets its first hardening pass — fresh BLAS activations on top, and the multi-vendor backends quietly grow underneath.

Today we released OxiCUDA 0.1.1 — the first incremental patch on our type-safe, memory-safe replacement for the NVIDIA CUDA Toolkit, adding six new oxicuda-blas elementwise activations and substantially expanding the ROCm, Vulkan, and WebGPU backends.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. The only runtime dependency remains the NVIDIA driver (libcuda.so / nvcuda.dll); PTX is generated directly from Rust, autotuned per architecture, and runs at near-peak throughput from Turing all the way to Blackwell. Yesterday we shipped OxiCUDA 0.1.0; today’s 0.1.1 is the first follow-up on that foundation — still early, but sharpening fast.

Why 0.1.1 matters

This is the first hardening pass after the debut, and it pulls in two directions at once — up into the high-level numerics, and out across hardware vendors.

On top: the BLAS activation surface fills in. oxicuda-blas already shipped the full cuBLAS-equivalent L1/L2/L3 surface plus elementwise ops and epilogue fusion. 0.1.1 rounds out the activation set with six new elementwise operations — HardSigmoid, HardSwish, Softplus, LeakyRelu, Ceil, and Floor. These complete the activation epilogue surface that DNN layers fuse directly into GEMM, so the building blocks for modern neural-network fusion are all present in the BLAS layer now rather than bolted on later.

Underneath: the portability story deepens. A large share of this release lands in the multi-vendor backends. The real diff is roughly 13,392 insertions across 82 files, concentrated in the non-NVIDIA paths: ROCm HIP kernels, Vulkan SPIR-V codegen, and WebGPU compute shaders all grew substantially, with a matching PTX elementwise template added on the CUDA side. OxiCUDA was built so the same high-level operations can run beyond NVIDIA silicon, and 0.1.1 is the first release where that promise visibly fills out.

Alongside both, this patch carries general robustness, performance, and internal code-quality improvements across the crates (the CHANGELOG’s “Changed” line) — the unglamorous but necessary work of a first follow-up.

Technical Deep Dive

Three layers moved in this release.

1. The numeric layer — new activations in oxicuda-blas. The crate provides the full cuBLAS-equivalent surface: Level 1/2/3 routines, GEMM (SIMT, Tensor Core, Split-K), batched variants, the F16/BF16/TF32/F32/F64/FP8 precision matrix, reductions, and epilogue fusion. The six new elementwise activations slot into that elementwise/epilogue surface, and a matching elementwise template lands in oxicuda-ptx so they lower to the same generated-PTX path as the rest of the kernel library.

2. The backend layer — the same ops, more silicon. OxiCUDA’s seven backends sit behind a single ComputeBackend trait (oxicuda-backend). This release pushes hard on three of them:

oxicuda-rocm — AMD HIP kernels for the ROCm path.
oxicuda-vulkan — Vulkan compute via SPIR-V codegen.
oxicuda-webgpu — WebGPU compute shaders for the browser / WASM target.

Because they implement a shared trait, the same high-level operations dispatch across vendors without rewriting the call sites above them — the portability that makes “pure-Rust CUDA stack” mean more than “NVIDIA only.”

3. The foundation it all rests on. None of this is new architecture — it builds on the 10-Volume, 28-crate structure shipped at 0.1.0: the Driver/Memory/Launch/Runtime foundation, PTX codegen with a built-in autotuner, and the BLAS/DNN/scientific/training/inference volumes on top. 0.1.1 thickens the existing layers rather than adding new ones.

Getting Started

Add the umbrella crate:

cargo add oxicuda

A minimal GEMM looks like this — and as of 0.1.1, activations such as HardSwish and Softplus are available in the BLAS layer to fuse into the epilogue:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The new HardSigmoid / HardSwish / Softplus / LeakyRelu / Ceil / Floor elementwise ops live in the blas feature, alongside the rest of the cuBLAS-equivalent surface.

What’s New in 0.1.1

Six new oxicuda-blas elementwise activations: HardSigmoid, HardSwish, Softplus, LeakyRelu, Ceil, Floor.
Substantial multi-vendor backend growth: expanded ROCm HIP kernels, Vulkan SPIR-V codegen, and WebGPU compute shaders (~13,392 insertions across 82 files), plus a new PTX elementwise template.
General enhancements across crates: improved robustness, performance, and internal code quality.
Scale: now roughly 248K lines of safe Rust across 28 crates.

Tips

Enable blas for the new activations. The six elementwise ops ship in the blas feature, which is off by default — turn it on:
```
cargo add oxicuda --features blas
```
Not on NVIDIA? Try the other backends. For AMD, Intel-adjacent, or browser targets, reach for the rocm, vulkan, or webgpu feature flags — the same high-level ops dispatch through the shared ComputeBackend trait.
HardSwish / HardSigmoid are mobile-friendly. These piecewise-linear activations are cheap and quantization-friendly — and now available GPU-side in OxiCUDA.
Fuse activations into GEMM epilogues. Rather than running a separate elementwise pass, apply the activation as part of the GEMM epilogue to save a kernel launch and a round-trip through memory.
Reminder on macOS: OxiCUDA compiles there but returns UnsupportedPlatform at runtime — full execution is Linux x86_64 and Windows x86_64.

Part of a sovereign GPU stack

OxiCUDA is the GPU compute floor of the COOLJAPAN ecosystem. SciRS2, OxiONNX, TrustformeRS, and ToRSh all sit above it and consume its kernels; OxiBLAS and OxiFFT are the linear-algebra and FFT siblings alongside it. And shipping today, OxiEML joins the family — more pure-Rust ML tooling built on the same sovereign foundation. The whole stack rests on a single runtime dependency: the NVIDIA driver, nothing more.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a C/C++-free CUDA stack is something you want to see grow — every star helps. Thanks for following along on day two.

— KitaSan at COOLJAPAN OÜ April 14, 2026