COOLJAPAN
← All posts

OxiCUDA 0.1.1 Released — New BLAS Activations and Hardened GPU Backends

First patch on the pure-Rust NVIDIA CUDA Toolkit replacement: six new oxicuda-blas elementwise activations (HardSigmoid, HardSwish, Softplus, LeakyRelu, Ceil, Floor) plus substantial ROCm/Vulkan/WebGPU backend growth. ~248K lines across 28 crates.

release oxicuda cuda gpu-computing pure-rust blas rocm vulkan webgpu

The pure-Rust CUDA stack gets its first hardening pass — fresh BLAS activations on top, and the multi-vendor backends quietly grow underneath.

Today we released OxiCUDA 0.1.1 — the first incremental patch on our type-safe, memory-safe replacement for the NVIDIA CUDA Toolkit, adding six new oxicuda-blas elementwise activations and substantially expanding the ROCm, Vulkan, and WebGPU backends.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. The only runtime dependency remains the NVIDIA driver (libcuda.so / nvcuda.dll); PTX is generated directly from Rust, autotuned per architecture, and runs at near-peak throughput from Turing all the way to Blackwell. Yesterday we shipped OxiCUDA 0.1.0; today’s 0.1.1 is the first follow-up on that foundation — still early, but sharpening fast.

Why 0.1.1 matters

This is the first hardening pass after the debut, and it pulls in two directions at once — up into the high-level numerics, and out across hardware vendors.

On top: the BLAS activation surface fills in. oxicuda-blas already shipped the full cuBLAS-equivalent L1/L2/L3 surface plus elementwise ops and epilogue fusion. 0.1.1 rounds out the activation set with six new elementwise operations — HardSigmoid, HardSwish, Softplus, LeakyRelu, Ceil, and Floor. These complete the activation epilogue surface that DNN layers fuse directly into GEMM, so the building blocks for modern neural-network fusion are all present in the BLAS layer now rather than bolted on later.

Underneath: the portability story deepens. A large share of this release lands in the multi-vendor backends. The real diff is roughly 13,392 insertions across 82 files, concentrated in the non-NVIDIA paths: ROCm HIP kernels, Vulkan SPIR-V codegen, and WebGPU compute shaders all grew substantially, with a matching PTX elementwise template added on the CUDA side. OxiCUDA was built so the same high-level operations can run beyond NVIDIA silicon, and 0.1.1 is the first release where that promise visibly fills out.

Alongside both, this patch carries general robustness, performance, and internal code-quality improvements across the crates (the CHANGELOG’s “Changed” line) — the unglamorous but necessary work of a first follow-up.

Technical Deep Dive

Three layers moved in this release.

1. The numeric layer — new activations in oxicuda-blas. The crate provides the full cuBLAS-equivalent surface: Level 1/2/3 routines, GEMM (SIMT, Tensor Core, Split-K), batched variants, the F16/BF16/TF32/F32/F64/FP8 precision matrix, reductions, and epilogue fusion. The six new elementwise activations slot into that elementwise/epilogue surface, and a matching elementwise template lands in oxicuda-ptx so they lower to the same generated-PTX path as the rest of the kernel library.

2. The backend layer — the same ops, more silicon. OxiCUDA’s seven backends sit behind a single ComputeBackend trait (oxicuda-backend). This release pushes hard on three of them:

Because they implement a shared trait, the same high-level operations dispatch across vendors without rewriting the call sites above them — the portability that makes “pure-Rust CUDA stack” mean more than “NVIDIA only.”

3. The foundation it all rests on. None of this is new architecture — it builds on the 10-Volume, 28-crate structure shipped at 0.1.0: the Driver/Memory/Launch/Runtime foundation, PTX codegen with a built-in autotuner, and the BLAS/DNN/scientific/training/inference volumes on top. 0.1.1 thickens the existing layers rather than adding new ones.

Getting Started

Add the umbrella crate:

cargo add oxicuda

A minimal GEMM looks like this — and as of 0.1.1, activations such as HardSwish and Softplus are available in the BLAS layer to fuse into the epilogue:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The new HardSigmoid / HardSwish / Softplus / LeakyRelu / Ceil / Floor elementwise ops live in the blas feature, alongside the rest of the cuBLAS-equivalent surface.

What’s New in 0.1.1

Tips

Part of a sovereign GPU stack

OxiCUDA is the GPU compute floor of the COOLJAPAN ecosystem. SciRS2, OxiONNX, TrustformeRS, and ToRSh all sit above it and consume its kernels; OxiBLAS and OxiFFT are the linear-algebra and FFT siblings alongside it. And shipping today, OxiEML joins the family — more pure-Rust ML tooling built on the same sovereign foundation. The whole stack rests on a single runtime dependency: the NVIDIA driver, nothing more.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a C/C++-free CUDA stack is something you want to see grow — every star helps. Thanks for following along on day two.

KitaSan at COOLJAPAN OÜ April 14, 2026

↑ Back to all posts