The pure-Rust CUDA stack gets its first hardening pass — fresh BLAS activations on top, and the multi-vendor backends quietly grow underneath.
Today we released OxiCUDA 0.1.1 — the first incremental patch on our type-safe, memory-safe replacement for the NVIDIA CUDA Toolkit, adding six new oxicuda-blas elementwise activations and substantially expanding the ROCm, Vulkan, and WebGPU backends.
No CUDA SDK. No nvcc. No C/C++ toolchain at build time. The only runtime dependency remains the NVIDIA driver (libcuda.so / nvcuda.dll); PTX is generated directly from Rust, autotuned per architecture, and runs at near-peak throughput from Turing all the way to Blackwell. Yesterday we shipped OxiCUDA 0.1.0; today’s 0.1.1 is the first follow-up on that foundation — still early, but sharpening fast.
Why 0.1.1 matters
This is the first hardening pass after the debut, and it pulls in two directions at once — up into the high-level numerics, and out across hardware vendors.
On top: the BLAS activation surface fills in. oxicuda-blas already shipped the full cuBLAS-equivalent L1/L2/L3 surface plus elementwise ops and epilogue fusion. 0.1.1 rounds out the activation set with six new elementwise operations — HardSigmoid, HardSwish, Softplus, LeakyRelu, Ceil, and Floor. These complete the activation epilogue surface that DNN layers fuse directly into GEMM, so the building blocks for modern neural-network fusion are all present in the BLAS layer now rather than bolted on later.
Underneath: the portability story deepens. A large share of this release lands in the multi-vendor backends. The real diff is roughly 13,392 insertions across 82 files, concentrated in the non-NVIDIA paths: ROCm HIP kernels, Vulkan SPIR-V codegen, and WebGPU compute shaders all grew substantially, with a matching PTX elementwise template added on the CUDA side. OxiCUDA was built so the same high-level operations can run beyond NVIDIA silicon, and 0.1.1 is the first release where that promise visibly fills out.
Alongside both, this patch carries general robustness, performance, and internal code-quality improvements across the crates (the CHANGELOG’s “Changed” line) — the unglamorous but necessary work of a first follow-up.
Technical Deep Dive
Three layers moved in this release.
1. The numeric layer — new activations in oxicuda-blas. The crate provides the full cuBLAS-equivalent surface: Level 1/2/3 routines, GEMM (SIMT, Tensor Core, Split-K), batched variants, the F16/BF16/TF32/F32/F64/FP8 precision matrix, reductions, and epilogue fusion. The six new elementwise activations slot into that elementwise/epilogue surface, and a matching elementwise template lands in oxicuda-ptx so they lower to the same generated-PTX path as the rest of the kernel library.
2. The backend layer — the same ops, more silicon. OxiCUDA’s seven backends sit behind a single ComputeBackend trait (oxicuda-backend). This release pushes hard on three of them:
oxicuda-rocm— AMD HIP kernels for the ROCm path.oxicuda-vulkan— Vulkan compute via SPIR-V codegen.oxicuda-webgpu— WebGPU compute shaders for the browser / WASM target.
Because they implement a shared trait, the same high-level operations dispatch across vendors without rewriting the call sites above them — the portability that makes “pure-Rust CUDA stack” mean more than “NVIDIA only.”
3. The foundation it all rests on. None of this is new architecture — it builds on the 10-Volume, 28-crate structure shipped at 0.1.0: the Driver/Memory/Launch/Runtime foundation, PTX codegen with a built-in autotuner, and the BLAS/DNN/scientific/training/inference volumes on top. 0.1.1 thickens the existing layers rather than adding new ones.
Getting Started
Add the umbrella crate:
cargo add oxicuda
A minimal GEMM looks like this — and as of 0.1.1, activations such as HardSwish and Softplus are available in the BLAS layer to fuse into the epilogue:
use oxicuda::prelude::*;
fn main() -> Result<(), oxicuda::Error> {
let device = Device::get(0)?;
let ctx = Context::new(device)?;
let stream = Stream::new(&ctx)?;
let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;
d_a.copy_from_host(&host_a)?;
d_b.copy_from_host(&host_b)?;
let handle = BlasHandle::new(&stream)?;
handle.gemm(
Transpose::None, Transpose::None,
m, n, k,
1.0f32, // alpha
&d_a, lda,
&d_b, ldb,
0.0f32, // beta
&mut d_c, ldc,
)?;
stream.synchronize()?;
let mut result = vec![0.0f32; m * n];
d_c.copy_to_host(&mut result)?;
Ok(())
}
The new HardSigmoid / HardSwish / Softplus / LeakyRelu / Ceil / Floor elementwise ops live in the blas feature, alongside the rest of the cuBLAS-equivalent surface.
What’s New in 0.1.1
- Six new
oxicuda-blaselementwise activations:HardSigmoid,HardSwish,Softplus,LeakyRelu,Ceil,Floor. - Substantial multi-vendor backend growth: expanded ROCm HIP kernels, Vulkan SPIR-V codegen, and WebGPU compute shaders (~13,392 insertions across 82 files), plus a new PTX elementwise template.
- General enhancements across crates: improved robustness, performance, and internal code quality.
- Scale: now roughly 248K lines of safe Rust across 28 crates.
Tips
-
Enable
blasfor the new activations. The six elementwise ops ship in theblasfeature, which is off by default — turn it on:cargo add oxicuda --features blas -
Not on NVIDIA? Try the other backends. For AMD, Intel-adjacent, or browser targets, reach for the
rocm,vulkan, orwebgpufeature flags — the same high-level ops dispatch through the sharedComputeBackendtrait. -
HardSwish/HardSigmoidare mobile-friendly. These piecewise-linear activations are cheap and quantization-friendly — and now available GPU-side in OxiCUDA. -
Fuse activations into GEMM epilogues. Rather than running a separate elementwise pass, apply the activation as part of the GEMM epilogue to save a kernel launch and a round-trip through memory.
-
Reminder on macOS: OxiCUDA compiles there but returns
UnsupportedPlatformat runtime — full execution is Linux x86_64 and Windows x86_64.
Part of a sovereign GPU stack
OxiCUDA is the GPU compute floor of the COOLJAPAN ecosystem. SciRS2, OxiONNX, TrustformeRS, and ToRSh all sit above it and consume its kernels; OxiBLAS and OxiFFT are the linear-algebra and FFT siblings alongside it. And shipping today, OxiEML joins the family — more pure-Rust ML tooling built on the same sovereign foundation. The whole stack rests on a single runtime dependency: the NVIDIA driver, nothing more.
Repository: https://github.com/cool-japan/oxicuda
Star the repo if a C/C++-free CUDA stack is something you want to see grow — every star helps. Thanks for following along on day two.
— KitaSan at COOLJAPAN OÜ April 14, 2026