OxiCUDA 0.1.7 Released — Tensor Core SYR2K Completes the Symmetric Rank-Update Family

The symmetric rank-update family on Tensor Cores is now complete — rank-k and rank-2k, both fused, both in pure Rust.

Today we released OxiCUDA 0.1.7 — a type-safe, memory-safe pure-Rust replacement for the NVIDIA CUDA Toolkit that now drives symmetric rank-2k updates through a fused Tensor Core SYR2K kernel.

No CUDA SDK. No nvcc. No C/C++ toolchain. OxiCUDA generates PTX directly from Rust and talks to the GPU through nothing but the NVIDIA driver (libcuda.so / nvcuda.dll). The result compiles into a single static binary — or even WASM — with no host-side build dependencies to chase. One built-in autotuner benchmarks kernel variants per architecture from Turing all the way to Blackwell, and a set of multi-vendor backends keeps the same code running far beyond NVIDIA hardware.

Why 0.1.7 matters

The CUDA Toolkit has been the price of admission to GPU compute for over a decade, and that price is steep: a sprawling C/C++ stack where a stray pointer is a silent corruption, an nvcc-shaped build dependency that pins you to a particular toolchain, vendor lock-in baked into every library boundary, and effectively zero portability off NVIDIA silicon. OxiCUDA replaces that stack wholesale in safe Rust, and each release tightens the BLAS and kernel layers that real numerical code leans on.

This is a focused, incremental release — one week after 0.1.6 — and it lands exactly where it counts:

SYR2K on Tensor Cores. A symmetric rank-2k update, C ← α(A·Bᵀ + B·Aᵀ) + βC, executed on Tensor Core hardware units with the two cross-products fused into a single accumulation pass. This is the natural companion to the rank-k SYRK fast path that 0.1.6 introduced the week before, and together they complete the symmetric-update family on Tensor Cores: rank-k (SYRK) and rank-2k (SYR2K), both Tensor-Core accelerated.
Cross-subsystem CUDA kernel enhancements. Kernel improvements landed across the driver, memory, launch, BLAS, and backend layers — broad polish rather than a single hot spot.
MOS scheduling improvements. Multi-Operation Scheduling got better at orchestrating GPU task graphs, which pays off most when you are dispatching many operations rather than one large one.

No grand totals, no invented benchmarks — just the rank-2k companion the symmetric family was missing, plus the surrounding kernel and scheduling work.

Technical Deep Dive: Symmetric Rank Updates on Tensor Cores

What a rank-2k update is, and why fusing matters. A symmetric rank-2k update computes C ← α(A·Bᵀ + B·Aᵀ) + βC, where C is symmetric. It shows up constantly in numerical linear algebra: forming covariance and Gram matrices, assembling the normal equations behind least-squares fits, and the symmetric eigenvalue and SVD machinery underneath them. The naive route is two separate matrix multiplies (A·Bᵀ and B·Aᵀ) followed by an add — three full passes over the data. OxiCUDA’s syr2k instead fuses A×Bᵀ + B×Aᵀ into a single accumulation, exploiting the symmetry so only the relevant triangle is computed and the cross-products land in the same accumulators. Fewer passes, less memory traffic, and the symmetric structure is honored end to end. The public entry point is syr2k(...), implemented in crates/oxicuda-blas/src/level3/syr2k.rs.

How it rides the PTX Tensor Core path. The kernel is emitted through OxiCUDA’s PTX codegen layer — there is no precompiled cubin and no nvcc in sight. The PTX DSL covers SM 7.5 through 10.0 and generates Tensor Core instructions across the three generations of the hardware API: WMMA on Turing/Volta-class parts, MMA on Ampere and Ada, and WGMMA on Hopper and Blackwell. SYR2K’s fused accumulation maps cleanly onto these warp- and warpgroup-level matrix ops, so the same high-level update specializes to whatever Tensor Core dialect the target GPU speaks.

How MOS sharpens orchestration. Multi-Operation Scheduling sits above individual kernel launches and coordinates how operations are sequenced and overlapped on the device. The 0.1.7 improvements mean that when a workload fires off many GPU operations — a wave of small GEMMs, a batched factorization, a chain of rank updates — the scheduler keeps the device busier and the dependencies straighter than launching each op in isolation would.

How the autotuner picks the variant. OxiCUDA’s three-tier dispatch chooses, at call time, among cached, tuned, and default kernel variants, backed by a disk cache that persists across runs. For SYR2K that means the Tensor Core tiling — fragment shapes, accumulation strategy, the WMMA/MMA/WGMMA path — is selected per GPU architecture by the autotuner rather than hard-coded. The first run benchmarks; every run after that reads the winning variant straight from cache.

These layers sit on the established OxiCUDA stack: the Foundation crates (oxicuda-driver, oxicuda-memory, oxicuda-launch, oxicuda-runtime); PTX Codegen and the Autotuner (oxicuda-ptx, oxicuda-autotune); a cuBLAS-equivalent BLAS layer (L1/L2/L3, GEMM in SIMT/Tensor-Core/Split-K forms, batched, across FP16/BF16/TF32/F32/F64/FP8, with the symmetric SYRK/SYR2K family now Tensor-Core accelerated); a cuDNN-equivalent DNN layer; the scientific-computing suite (FFT, sparse, solver, rand); and seven portability backends (Metal, Vulkan, WebGPU, ROCm, Intel Level Zero, the backend trait itself, and CUB-equivalent primitives).

Getting Started

Add OxiCUDA with the BLAS subsystem enabled. The default features are driver, memory, and launch; everything else is opt-in by feature flag.

cargo add oxicuda --features blas

A minimal GEMM through a BlasHandle:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;
    d_a.copy_from_host(&host_a)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32, &d_a, lda, &d_b, ldb,
        0.0f32, &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    Ok(())
}

And the new symmetric rank-2k update — a single fused call instead of two GEMMs plus an add:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;
    let handle = BlasHandle::new(&stream)?;

    // C <- alpha * (A * B^T + B * A^T) + beta * C, with C symmetric.
    handle.syr2k(
        Fill::Upper, Transpose::None,
        n, k,
        1.0f32, &d_a, lda, &d_b, ldb,
        0.0f32, &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    Ok(())
}

Opt into other subsystems the same way: dnn, fft, sparse, solver, rand, autotune, ptx, or full to pull in everything at once.

What’s New in 0.1.7

SYR2K Tensor Core kernel — a symmetric rank-2k update on Tensor Core hardware with fused A×Bᵀ + B×Aᵀ accumulation, complementing the rank-k SYRK fast path from 0.1.6 and completing the symmetric-update family.
Cross-subsystem CUDA kernel enhancements — kernel improvements spanning the driver, memory, launch, BLAS, and backend layers.
MOS (Multi-Operation Scheduling) improvements — better orchestration of GPU task graphs, especially for workloads built from many operations.

Tips

Reach for SYR2K on symmetric rank-2k work. Whenever you would compute C ← A·Bᵀ + B·Aᵀ — covariance and Gram matrices, the symmetric terms of normal equations in least-squares — use the fused syr2k call instead of issuing two separate GEMMs and adding. One pass, the symmetry exploited, less memory traffic.
Pair SYRK and SYR2K. SYRK (rank-k, C ← A·Aᵀ) and SYR2K (rank-2k, C ← A·Bᵀ + B·Aᵀ) now cover the symmetric family on Tensor Cores. Pick rank-k for self-products and rank-2k for cross-products, and let both ride the same accelerated path.
Enable only the features you need. Defaults are driver/memory/launch. Add blas for BLAS, fft/sparse/solver/rand for the scientific suite, dnn for neural-net primitives — or full when you want it all. Smaller feature sets mean leaner builds.
Let the autotuner cache do the picking. With the autotune feature on, the first SYR2K (or GEMM) on a given GPU benchmarks the Tensor Core tilings and writes the winner to the disk cache; subsequent runs read it back. Warm the cache once and the best variant is chosen for you per architecture.
Lean on MOS for many small ops. If your hot path launches lots of little operations rather than one big kernel, the Multi-Operation Scheduling improvements in 0.1.7 are where the orchestration wins land — keep the device fed by letting MOS sequence the graph.

This is the foundation

OxiCUDA is the GPU layer under the COOLJAPAN numerical stack. It is the silicon-facing tier beneath SciRS2 and NumRS2 for scientific and array computing, ToRSh and TrustformeRS for tensors and transformers, OxiLLaMa and OxiONNX for inference, and OxiBLAS and OxiFFT for the linear-algebra and spectral primitives those libraries call. Every SYR2K kernel that lands here makes the symmetric algebra above it faster — in safe, portable, pure Rust.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if you want GPU compute without the C/C++ toolchain, without nvcc, and without the lock-in.

Pure Rust GPU computing is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ May 16, 2026