COOLJAPAN
← All posts

OxiCUDA 0.1.6 Released — Tensor Core SYRK Fast Path and Sixteen New ML Crates

Pure-Rust replacement for the NVIDIA CUDA Toolkit. OxiCUDA 0.1.6 adds a Tensor Core fast path for SYRK in oxicuda-blas and sixteen new ML crates (adversarial, SSL, continual, multimodal, 3D geometry, PINN, ANN, anomaly, causal, meta, MoE, NeRF, quantum, recsys, RLHF, tabular). No CUDA SDK, no nvcc.

release oxicuda cuda gpu-computing pure-rust tensor-core cublas ptx machine-learning

The CUDA Toolkit, rewritten in safe Rust — and today it grows a Tensor Core SYRK kernel and sixteen new ML domains.

Today we released OxiCUDA 0.1.6 — a type-safe, memory-safe, pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack (cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more), now with a triangle-masked Tensor Core fast path for symmetric rank-k updates and sixteen new ML-domain crates spanning adversarial robustness to tabular learning.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. The only thing OxiCUDA needs at runtime is the NVIDIA driver (libcuda.so / nvcuda.dll) — everything above it, including the PTX assembly that runs on the GPU, is generated directly from Rust. OxiCUDA compiles into a single static binary (or a WASM module), and the same code runs on Turing through Blackwell, with multi-vendor backends reaching beyond NVIDIA hardware entirely.

Why OxiCUDA 0.1.6 matters

The CUDA Toolkit is the backbone of modern GPU computing, and it is also a wall of C and C++. That brings the familiar costs: undefined behavior and segfaults hiding behind raw pointers and hand-managed device memory; a build-time dependency on nvcc and the matching SDK that has to be installed, versioned, and reconciled on every machine; hard lock-in to NVIDIA hardware; and painful portability when you want the same kernels to run on Metal, Vulkan, WebGPU, or ROCm. On top of that, wiring any of it into a safe-Rust application means threading FFI through unsafe and hoping the lifetimes line up.

OxiCUDA removes that wall. Device memory is owned by RAII buffers, kernel launches are checked, and there is no FFI surface to the CUDA SDK because there is no CUDA SDK — the toolchain is Rust, top to bottom. Because the kernels are emitted as PTX from Rust, the very same numerical code retargets to the portability backends without a separate C/C++ build for each.

This release sharpens that story in two concrete ways. First, oxicuda-blas gains a Tensor Core fast path for SYRK: a triangle-masked GEMM kernel that skips the redundant writes a symmetric output would otherwise generate, while still feeding the Tensor Core hardware units. Symmetric rank-k updates — covariance matrices, Gram matrices, normal equations — stop paying for the half of the result they never needed. Second, the ML surface widens dramatically: sixteen new leaf crates bring adversarial robustness, self-supervised learning, continual learning, multimodal fusion, 3-D geometry, physics-informed networks, approximate nearest neighbour, anomaly detection, causal inference, meta-learning, mixtures of experts, neural radiance fields, quantum-state simulation, recommenders, RLHF, and tabular learning into the same Pure Rust GPU stack.

Technical Deep Dive: from the driver up to sixteen ML domains

Foundation. At the base sit oxicuda-driver, oxicuda-memory, oxicuda-launch, and oxicuda-runtime — the safe wrappers over the driver API, the RAII device-memory layer, the typed launch machinery, and the runtime glue. These are the three default features (driver, memory, launch); everything else is opt-in.

PTX codegen and the autotuner. oxicuda-ptx is a PTX DSL that covers SM 7.5 through 10.0 — Turing to Blackwell — and emits the WMMA, MMA, and WGMMA instruction families that drive the Tensor Cores directly. oxicuda-autotune benchmarks kernel variants per GPU architecture and resolves launches through a three-tier dispatch (cached → tuned → default), backed by an on-disk cache so the tuning cost is paid once. This is the layer that lets a single Rust source produce architecture-specific kernels without a C compiler in sight.

BLAS with the new SYRK Tensor Core kernel. oxicuda-blas is the cuBLAS-equivalent layer: Level 1/2/3 routines, GEMM in SIMT, Tensor-Core, and Split-K variants, batched execution, and the full precision ladder (FP16, BF16, TF32, F32, F64, FP8). New in 0.1.6 is the symmetric rank-k fast path implemented across crates/oxicuda-blas/src/level3/syrk.rs, syrk_tc.rs, and syr2k.rs: the _tc variant masks the lower (or upper) triangle so the kernel computes and stores only the meaningful half of the symmetric result, then routes the multiply through the Tensor Core path. The companion SYR2K kernel applies the same idea to the two-product symmetric update.

The expanding ML-domain family (Vol.26–41). Above BLAS and the cuDNN-equivalent oxicuda-dnn (convolution including Winograd, FlashAttention forward and backward, PagedAttention, MoE, normalization, pooling, quantization) and the scientific-computing crates (oxicuda-fft with Stockham and Bluestein, oxicuda-sparse with CSR/CSC/COO/BSR/ELL SpMV/SpMM/SpGEMM, oxicuda-solver with LU/QR/SVD/Cholesky/CG/GMRES, oxicuda-rand with Philox/MRG32k3a/XORWOW/Sobol), this release adds sixteen specialized crates. They range from oxicuda-ann (flat/IVF/IVFPQ/HNSW/LSH/PQ and k-NN-graph indexes, with Hamming/L2/inner-product distances and SQ4/SQ8 quantizers) and oxicuda-moe (top-k routing, expert dispatch, load-balancing loss) to oxicuda-nerf, oxicuda-pinn, oxicuda-quantum, and oxicuda-rlhf. Each is Pure Rust and sits on the same foundation.

All told, OxiCUDA 0.1.6 is roughly 320K lines of safe Rust across 37 crates.

Getting Started

Add OxiCUDA and turn on the BLAS layer:

cargo add oxicuda --features blas

The default features are driver, memory, and launch; every subsystem above that — blas, dnn, fft, sparse, solver, rand, autotune, ptx, or the full umbrella — is something you opt into. Here is a single-precision GEMM end to end:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;
    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32, &d_a, lda, &d_b, ldb,
        0.0f32, &mut d_c, ldc,
    )?;
    stream.synchronize()?;

    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The memory is owned by DeviceBuffer, the launch is checked, and there is no nvcc step — building the binary is all the toolchain you need.

What’s New in 0.1.6

Tips

This is the foundation

OxiCUDA is the GPU layer beneath the rest of the COOLJAPAN ecosystem. When SciRS2 and NumRS2 run array math on device, when ToRSh and TrustformeRS train, when OxiLLaMa serves a model or OxiONNX executes a graph, and when OxiBLAS and OxiFFT need a GPU backend, OxiCUDA is what carries the kernels — safely, and without an NVIDIA SDK in the build.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a CUDA Toolkit you can build with nothing but a Rust compiler sounds like the future you want.

Pure Rust GPU computing is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ May 9, 2026

↑ Back to all posts