COOLJAPAN
← All posts

OxiCUDA 0.1.7 Released — Tensor Core SYR2K Completes the Symmetric Rank-Update Family

Pure-Rust replacement for the entire NVIDIA CUDA Toolkit. 0.1.7 adds a SYR2K Tensor Core kernel (fused A×Bᵀ + B×Aᵀ rank-2k update) to oxicuda-blas, cross-subsystem CUDA kernel enhancements, and Multi-Operation Scheduling improvements. No CUDA SDK, no nvcc, no C/C++ toolchain.

release oxicuda cuda gpu-computing pure-rust tensor-core cublas blas ptx

The symmetric rank-update family on Tensor Cores is now complete — rank-k and rank-2k, both fused, both in pure Rust.

Today we released OxiCUDA 0.1.7 — a type-safe, memory-safe pure-Rust replacement for the NVIDIA CUDA Toolkit that now drives symmetric rank-2k updates through a fused Tensor Core SYR2K kernel.

No CUDA SDK. No nvcc. No C/C++ toolchain. OxiCUDA generates PTX directly from Rust and talks to the GPU through nothing but the NVIDIA driver (libcuda.so / nvcuda.dll). The result compiles into a single static binary — or even WASM — with no host-side build dependencies to chase. One built-in autotuner benchmarks kernel variants per architecture from Turing all the way to Blackwell, and a set of multi-vendor backends keeps the same code running far beyond NVIDIA hardware.

Why 0.1.7 matters

The CUDA Toolkit has been the price of admission to GPU compute for over a decade, and that price is steep: a sprawling C/C++ stack where a stray pointer is a silent corruption, an nvcc-shaped build dependency that pins you to a particular toolchain, vendor lock-in baked into every library boundary, and effectively zero portability off NVIDIA silicon. OxiCUDA replaces that stack wholesale in safe Rust, and each release tightens the BLAS and kernel layers that real numerical code leans on.

This is a focused, incremental release — one week after 0.1.6 — and it lands exactly where it counts:

No grand totals, no invented benchmarks — just the rank-2k companion the symmetric family was missing, plus the surrounding kernel and scheduling work.

Technical Deep Dive: Symmetric Rank Updates on Tensor Cores

What a rank-2k update is, and why fusing matters. A symmetric rank-2k update computes C ← α(A·Bᵀ + B·Aᵀ) + βC, where C is symmetric. It shows up constantly in numerical linear algebra: forming covariance and Gram matrices, assembling the normal equations behind least-squares fits, and the symmetric eigenvalue and SVD machinery underneath them. The naive route is two separate matrix multiplies (A·Bᵀ and B·Aᵀ) followed by an add — three full passes over the data. OxiCUDA’s syr2k instead fuses A×Bᵀ + B×Aᵀ into a single accumulation, exploiting the symmetry so only the relevant triangle is computed and the cross-products land in the same accumulators. Fewer passes, less memory traffic, and the symmetric structure is honored end to end. The public entry point is syr2k(...), implemented in crates/oxicuda-blas/src/level3/syr2k.rs.

How it rides the PTX Tensor Core path. The kernel is emitted through OxiCUDA’s PTX codegen layer — there is no precompiled cubin and no nvcc in sight. The PTX DSL covers SM 7.5 through 10.0 and generates Tensor Core instructions across the three generations of the hardware API: WMMA on Turing/Volta-class parts, MMA on Ampere and Ada, and WGMMA on Hopper and Blackwell. SYR2K’s fused accumulation maps cleanly onto these warp- and warpgroup-level matrix ops, so the same high-level update specializes to whatever Tensor Core dialect the target GPU speaks.

How MOS sharpens orchestration. Multi-Operation Scheduling sits above individual kernel launches and coordinates how operations are sequenced and overlapped on the device. The 0.1.7 improvements mean that when a workload fires off many GPU operations — a wave of small GEMMs, a batched factorization, a chain of rank updates — the scheduler keeps the device busier and the dependencies straighter than launching each op in isolation would.

How the autotuner picks the variant. OxiCUDA’s three-tier dispatch chooses, at call time, among cached, tuned, and default kernel variants, backed by a disk cache that persists across runs. For SYR2K that means the Tensor Core tiling — fragment shapes, accumulation strategy, the WMMA/MMA/WGMMA path — is selected per GPU architecture by the autotuner rather than hard-coded. The first run benchmarks; every run after that reads the winning variant straight from cache.

These layers sit on the established OxiCUDA stack: the Foundation crates (oxicuda-driver, oxicuda-memory, oxicuda-launch, oxicuda-runtime); PTX Codegen and the Autotuner (oxicuda-ptx, oxicuda-autotune); a cuBLAS-equivalent BLAS layer (L1/L2/L3, GEMM in SIMT/Tensor-Core/Split-K forms, batched, across FP16/BF16/TF32/F32/F64/FP8, with the symmetric SYRK/SYR2K family now Tensor-Core accelerated); a cuDNN-equivalent DNN layer; the scientific-computing suite (FFT, sparse, solver, rand); and seven portability backends (Metal, Vulkan, WebGPU, ROCm, Intel Level Zero, the backend trait itself, and CUB-equivalent primitives).

Getting Started

Add OxiCUDA with the BLAS subsystem enabled. The default features are driver, memory, and launch; everything else is opt-in by feature flag.

cargo add oxicuda --features blas

A minimal GEMM through a BlasHandle:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;
    d_a.copy_from_host(&host_a)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32, &d_a, lda, &d_b, ldb,
        0.0f32, &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    Ok(())
}

And the new symmetric rank-2k update — a single fused call instead of two GEMMs plus an add:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;
    let handle = BlasHandle::new(&stream)?;

    // C <- alpha * (A * B^T + B * A^T) + beta * C, with C symmetric.
    handle.syr2k(
        Fill::Upper, Transpose::None,
        n, k,
        1.0f32, &d_a, lda, &d_b, ldb,
        0.0f32, &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    Ok(())
}

Opt into other subsystems the same way: dnn, fft, sparse, solver, rand, autotune, ptx, or full to pull in everything at once.

What’s New in 0.1.7

Tips

This is the foundation

OxiCUDA is the GPU layer under the COOLJAPAN numerical stack. It is the silicon-facing tier beneath SciRS2 and NumRS2 for scientific and array computing, ToRSh and TrustformeRS for tensors and transformers, OxiLLaMa and OxiONNX for inference, and OxiBLAS and OxiFFT for the linear-algebra and spectral primitives those libraries call. Every SYR2K kernel that lands here makes the symmetric algebra above it faster — in safe, portable, pure Rust.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if you want GPU compute without the C/C++ toolchain, without nvcc, and without the lock-in.

Pure Rust GPU computing is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ May 16, 2026

↑ Back to all posts