OxiCUDA 0.1.6 Released — Tensor Core SYRK Fast Path and Sixteen New ML Crates

The CUDA Toolkit, rewritten in safe Rust — and today it grows a Tensor Core SYRK kernel and sixteen new ML domains.

Today we released OxiCUDA 0.1.6 — a type-safe, memory-safe, pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack (cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more), now with a triangle-masked Tensor Core fast path for symmetric rank-k updates and sixteen new ML-domain crates spanning adversarial robustness to tabular learning.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. The only thing OxiCUDA needs at runtime is the NVIDIA driver (libcuda.so / nvcuda.dll) — everything above it, including the PTX assembly that runs on the GPU, is generated directly from Rust. OxiCUDA compiles into a single static binary (or a WASM module), and the same code runs on Turing through Blackwell, with multi-vendor backends reaching beyond NVIDIA hardware entirely.

Why OxiCUDA 0.1.6 matters

The CUDA Toolkit is the backbone of modern GPU computing, and it is also a wall of C and C++. That brings the familiar costs: undefined behavior and segfaults hiding behind raw pointers and hand-managed device memory; a build-time dependency on nvcc and the matching SDK that has to be installed, versioned, and reconciled on every machine; hard lock-in to NVIDIA hardware; and painful portability when you want the same kernels to run on Metal, Vulkan, WebGPU, or ROCm. On top of that, wiring any of it into a safe-Rust application means threading FFI through unsafe and hoping the lifetimes line up.

OxiCUDA removes that wall. Device memory is owned by RAII buffers, kernel launches are checked, and there is no FFI surface to the CUDA SDK because there is no CUDA SDK — the toolchain is Rust, top to bottom. Because the kernels are emitted as PTX from Rust, the very same numerical code retargets to the portability backends without a separate C/C++ build for each.

This release sharpens that story in two concrete ways. First, oxicuda-blas gains a Tensor Core fast path for SYRK: a triangle-masked GEMM kernel that skips the redundant writes a symmetric output would otherwise generate, while still feeding the Tensor Core hardware units. Symmetric rank-k updates — covariance matrices, Gram matrices, normal equations — stop paying for the half of the result they never needed. Second, the ML surface widens dramatically: sixteen new leaf crates bring adversarial robustness, self-supervised learning, continual learning, multimodal fusion, 3-D geometry, physics-informed networks, approximate nearest neighbour, anomaly detection, causal inference, meta-learning, mixtures of experts, neural radiance fields, quantum-state simulation, recommenders, RLHF, and tabular learning into the same Pure Rust GPU stack.

Technical Deep Dive: from the driver up to sixteen ML domains

Foundation. At the base sit oxicuda-driver, oxicuda-memory, oxicuda-launch, and oxicuda-runtime — the safe wrappers over the driver API, the RAII device-memory layer, the typed launch machinery, and the runtime glue. These are the three default features (driver, memory, launch); everything else is opt-in.

PTX codegen and the autotuner. oxicuda-ptx is a PTX DSL that covers SM 7.5 through 10.0 — Turing to Blackwell — and emits the WMMA, MMA, and WGMMA instruction families that drive the Tensor Cores directly. oxicuda-autotune benchmarks kernel variants per GPU architecture and resolves launches through a three-tier dispatch (cached → tuned → default), backed by an on-disk cache so the tuning cost is paid once. This is the layer that lets a single Rust source produce architecture-specific kernels without a C compiler in sight.

BLAS with the new SYRK Tensor Core kernel. oxicuda-blas is the cuBLAS-equivalent layer: Level 1/2/3 routines, GEMM in SIMT, Tensor-Core, and Split-K variants, batched execution, and the full precision ladder (FP16, BF16, TF32, F32, F64, FP8). New in 0.1.6 is the symmetric rank-k fast path implemented across crates/oxicuda-blas/src/level3/syrk.rs, syrk_tc.rs, and syr2k.rs: the _tc variant masks the lower (or upper) triangle so the kernel computes and stores only the meaningful half of the symmetric result, then routes the multiply through the Tensor Core path. The companion SYR2K kernel applies the same idea to the two-product symmetric update.

The expanding ML-domain family (Vol.26–41). Above BLAS and the cuDNN-equivalent oxicuda-dnn (convolution including Winograd, FlashAttention forward and backward, PagedAttention, MoE, normalization, pooling, quantization) and the scientific-computing crates (oxicuda-fft with Stockham and Bluestein, oxicuda-sparse with CSR/CSC/COO/BSR/ELL SpMV/SpMM/SpGEMM, oxicuda-solver with LU/QR/SVD/Cholesky/CG/GMRES, oxicuda-rand with Philox/MRG32k3a/XORWOW/Sobol), this release adds sixteen specialized crates. They range from oxicuda-ann (flat/IVF/IVFPQ/HNSW/LSH/PQ and k-NN-graph indexes, with Hamming/L2/inner-product distances and SQ4/SQ8 quantizers) and oxicuda-moe (top-k routing, expert dispatch, load-balancing loss) to oxicuda-nerf, oxicuda-pinn, oxicuda-quantum, and oxicuda-rlhf. Each is Pure Rust and sits on the same foundation.

All told, OxiCUDA 0.1.6 is roughly 320K lines of safe Rust across 37 crates.

Getting Started

Add OxiCUDA and turn on the BLAS layer:

cargo add oxicuda --features blas

The default features are driver, memory, and launch; every subsystem above that — blas, dnn, fft, sparse, solver, rand, autotune, ptx, or the full umbrella — is something you opt into. Here is a single-precision GEMM end to end:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;
    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32, &d_a, lda, &d_b, ldb,
        0.0f32, &mut d_c, ldc,
    )?;
    stream.synchronize()?;

    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

The memory is owned by DeviceBuffer, the launch is checked, and there is no nvcc step — building the binary is all the toolchain you need.

What’s New in 0.1.6

Tensor Core SYRK fast path in oxicuda-blas. A triangle-masked GEMM kernel computes only the meaningful half of a symmetric rank-k update and routes the multiply through the Tensor Core units, eliminating the redundant symmetric writes a full GEMM would perform. Lands in level3/syrk.rs, syrk_tc.rs, and syr2k.rs.
Sixteen new Pure Rust ML crates (Vol.26–41):
- Robustness and learning paradigms: oxicuda-adversarial (attack generation, adversarial-training primitives), oxicuda-ssl (contrastive, masked-autoencoder, distillation scaffolding), oxicuda-continual (PackNet, task-incremental training, forgetting mitigation), oxicuda-meta (MAML and Prototypical-Network scaffolding).
- Modalities and geometry: oxicuda-multimodal (cross-modal fusion, shared-encoder scaffolding), oxicuda-geometry3d (point-cloud ops, mesh primitives, spatial indexing), oxicuda-nerf (ray-marching, positional encoding, volume rendering), oxicuda-pinn (PDE loss terms, residual sampling).
- Retrieval and detection: oxicuda-ann (flat/IVF/IVFPQ/HNSW/LSH/PQ/KNN-graph indexes; Hamming/L2/inner-product distances; SQ4/SQ8 quantizers; k-NN heap select), oxicuda-anomaly (Mahalanobis and COPOD density estimators, kNN score, LOF).
- Reasoning and scaling: oxicuda-causal (do-calculus primitives, causal-graph scaffolding), oxicuda-moe (top-k routing, expert dispatch, load-balancing loss), oxicuda-quantum (qubit-state simulation primitives, variational-circuit scaffolding).
- Applications: oxicuda-recsys (collaborative filtering, embedding lookup, ranking loss), oxicuda-rlhf (reward-model scaffolding, PPO/DPO wrappers, KL-penalty helpers), oxicuda-tabular (feature encoding, gradient-boosted-tree scaffolding, TabNet blocks).

Tips

Compile only the features you use. The defaults are just driver/memory/launch; reach for --features blas or --features dnn (or full when you want everything) rather than pulling in subsystems you will not call. Smaller feature sets mean smaller, faster builds.
Use the SYRK fast path for symmetric rank-k work. When you are forming a covariance, a Gram matrix, or the normal equations AᵀA, prefer the new symmetric rank-k path over a general gemm. It computes only the triangle you need and still hits the Tensor Cores, so you stop paying for the mirrored half of the output.
Let the autotuner cache pay off. Enable --features autotune and let the three-tier dispatch warm its on-disk cache once; subsequent runs resolve straight to the tuned kernel for your GPU architecture.
Reach for oxicuda-ann for vector search. Need approximate nearest neighbour at scale? The IVFPQ and HNSW indexes, with SQ4/SQ8 quantizers and inner-product or L2 distances, are all GPU-resident and Pure Rust.
Use oxicuda-moe for sparse expert models. Top-k routing, expert dispatch, and the load-balancing loss come together so you can wire mixture-of-experts layers without leaving the stack.
Remember it all cross-compiles. Because there is no C/C++ in the default build, the entire stack targets the portability backends and even WASM — write the kernel once, ship it broadly.

This is the foundation

OxiCUDA is the GPU layer beneath the rest of the COOLJAPAN ecosystem. When SciRS2 and NumRS2 run array math on device, when ToRSh and TrustformeRS train, when OxiLLaMa serves a model or OxiONNX executes a graph, and when OxiBLAS and OxiFFT need a GPU backend, OxiCUDA is what carries the kernels — safely, and without an NVIDIA SDK in the build.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if a CUDA Toolkit you can build with nothing but a Rust compiler sounds like the future you want.

Pure Rust GPU computing is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ May 9, 2026