COOLJAPAN
← All posts

OxiCUDA 0.4.0 Released — On-Device GPU Validation Catches What CPU Parity Tests Never Could

OxiCUDA 0.4.0 is an on-device validation pass: for the first time, hand-written PTX across 60+ crates was JIT-compiled and run on real NVIDIA hardware (RTX A4000, sm_86, CUDA 12.4) instead of only checked for CPU-logic parity — catching register-shadowing bugs, base-2/base-e math errors, invalid PTX, and literal stub kernels. 38,093 tests passing, ~1.27M SLoC, 73 crates.

release oxicuda cuda gpu-computing pure-rust ptx gpu-validation nvidia reliability

Every kernel passed its CPU-logic tests. Until today, most of them had never once executed on a real GPU.

Today we released OxiCUDA 0.4.0 — an on-device validation pass. For the first time, hand-written PTX kernels across more than 60 crates were JIT-compiled with Module::from_ptx and executed on real NVIDIA GPU hardware — an RTX A4000, sm_86, CUDA 12.4 — with their output checked against a CPU oracle, instead of only being checked for CPU-side logic parity. This surfaced and fixed dozens of genuine bugs that no amount of CPU-only testing could ever have caught.

No CUDA SDK. No nvcc. No C/C++ toolchain. OxiCUDA is a type-safe, memory-safe, pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack — cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more, across 73 crates and ~1.27M lines of safe Rust. The only thing it needs at runtime is the NVIDIA driver itself (libcuda.so / nvcuda.dll). PTX is generated directly from Rust, a built-in autotuner specializes kernels per GPU architecture from Turing through Blackwell, and the whole thing compiles to a single static binary — or to WASM, or onto multi-vendor backends — without a single line of C++ in your build. 0.4.0 is the release where that hand-written PTX finally met real silicon.

Why OxiCUDA 0.4.0 is a game changer

Every prior release validated PTX kernels the only way possible without a GPU in CI: hand-translate the kernel’s logic into Rust, run it on the CPU, and assert the two agree. That catches a large class of bugs — but it is structurally blind to anything that lives only in the PTX text itself: a kernel ptxas rejects outright, a register name that happens to collide with a CUDA built-in, an lg2.approx dropped in where the math needed ln. The CPU path never touches the actual assembly, so it can never catch what’s wrong with the actual assembly.

0.4.0 closes that blind spot. A feature-gated gpu-tests harness now JIT-compiles each crate’s real PTX via Module::from_ptx, launches it for real on an RTX A4000, and diffs the device output against the same CPU oracle the parity tests already trusted — rolled out across more than 60 crates, with every test skipping gracefully when no device is present.

The encouraging news first: 13 crates — oxicuda-pinn, oxicuda-bayes, oxicuda-federated, oxicuda-continual, oxicuda-peft, oxicuda-meta, oxicuda-tn, oxicuda-sketch, oxicuda-graphalg, oxicuda-cvx, oxicuda-gen, oxicuda-adversarial, oxicuda-hdc — ran 42+ kernels clean on the very first on-device pass, zero defects found. The rest of the story is what happens when hand-written PTX meets a real assembler and real silicon for the first time:

Every fix in this release was verified the honest way: fail, revert to confirm it fails again, re-apply, confirm it passes — against the actual device, not a CPU stand-in.

Technical Deep Dive: how the validation pass works

The harness. Each crate gets a gpu-tests Cargo feature and a src/gpu_tests.rs module. The pattern is uniform: JIT-compile the crate’s own PTX with Module::from_ptx, allocate device buffers, launch with real launch parameters, copy back, and diff against the crate’s existing CPU reference — the same oracle its CPU-parity tests already trusted. No device, no failure: tests skip rather than fail, so cargo test and CI stay green everywhere, and cargo test --features gpu-tests is what you run once a GPU is actually attached. Two crates, oxicuda-pde and oxicuda-numeric, matched their CPU oracle cleanly but still surfaced an honestly-documented caveat rather than a silent pass: pde’s pre-completion fem_assemble_kernel was confirmed to do only a signed-triangle-area scatter, and numeric’s bessel_recurrence is documented as aliasing J_0 across points when a launch batches more than one — both noted rather than hidden.

Algorithms that went from partial to complete. A few kernels turned out to be proxies rather than the full algorithm, and were finished in the same pass. oxicuda-pde’s fem_assemble_kernel is now the full unconstrained dense P1 stiffness assembly — the 3×3 local K_ij = (1/(4·Area))·(b_i·b_j + c_i·c_j) per element, atomically scattered into the global matrix, validated element-wise against the crate’s own p1_local_stiffness to 1e-4 relative / 1e-5 absolute. oxicuda-tabular’s sparsemax_kernel now runs the exact Martins & Astudillo largest-support search, quantile_norm_kernel now does true empirical-CDF interpolation, and node_tree_eval_kernel — previously hardcoded to 2 leaves regardless of depth — now runs the full multi-level NODE tree. oxicuda-moe’s soft_moe_dispatch_kernel went from a first-slot-only proxy to the real 3-pass slot-softmax dispatch matrix. oxicuda-geometry3d’s Gaussian-splatting project_kernel now emits the full EWA 2D covariance it previously never wrote at all, and sh_eval_kernel now evaluates all 9 L=0..2 spherical-harmonic terms instead of a reduced 5-term basis.

Honesty over fabrication. oxicuda-solver’s symmetric eigensolver and QR/SVD device paths were launching an incomplete GPU kernel and reading back values that were never actually computed on-device; they’re now an explicit, documented exact-CPU host fallback until the real GPU kernel lands. oxicuda-ssl had three kernels (barlow_cross_corr_wgmma, nt_xent_softmax_warp, gather_features_bulk) that turn out to be intentionally Hopper/Blackwell-only PTX (wgmma, redux.sync, TMA) — each is now documented as such, with an on-device-confirmed portable scalar fallback for Ampere and older.

New algorithmic surface. Alongside the validation sweep: oxicuda-cvx gained preconditioned conjugate gradient — pcg_solve / cg_solve_counted plus a Preconditioner trait with IdentityPrecond/JacobiPrecond, where Jacobi preconditioning cuts a κ=1e4 diagonal test system from 6 CG iterations down to 1. oxicuda-dnn gained GPT-NeoX half-split partial-rotary RoPE alongside the existing GPT-J/RoFormer interleaved variant. oxicuda-pinn replaced the FNO spectral path’s O(N²) brute-force DFT with a hand-written pure-Rust FFT (radix-2 Cooley-Tukey plus Bluestein for arbitrary N). oxicuda-tabular added VarObliviousLayer (variable-depth NODE oblivious trees) and TabRecordLayer (TabR-style retrieval attention). oxicuda-recsys implemented 4 previously-stub kernels: embedding_lookup, dot_score, bpr_gradient, lightgcn_propagate. oxicuda-webgpu moved to real WGSL parse+validate testing via naga across all 15 shader generators.

Getting Started

Add OxiCUDA and opt into the subsystems you need:

cargo add oxicuda --features blas

Default features are driver, memory, and launch. Each subsystem is its own feature flag (ptx, autotune, blas, dnn, fft, sparse, solver, rand, and full for everything) — and, if you have real GPU hardware and want to run the same on-device validation this release relies on, gpu-tests.

A complete GEMM, end to end:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    // Initialize driver and select GPU device
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    // Allocate device memory
    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    // Copy host data to device
    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    // Launch a GEMM: C = alpha * A @ B + beta * C
    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;

    stream.synchronize()?;

    // Copy result back to host
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

On a machine with an NVIDIA GPU, run the same validation this release depends on:

cargo test --features gpu-tests

No nvcc invocation, no .cu files, no linker flags chasing a CUDA install. cargo build, and you ship.

What’s New in 0.4.0

Tips

This is the foundation

OxiCUDA is the GPU layer beneath the rest of the COOLJAPAN ecosystem, and this release directly derisks that role for everything built on top of it. ToRSh 0.1.3’s new CudaBackend reaches into oxicuda-backend, oxicuda-driver, oxicuda-launch, and oxicuda-ptx; sklears 0.1.2 carries the OxiCUDA v0.3 family — oxicuda-blas, oxicuda-solver, oxicuda-manifold, oxicuda-dnn, oxicuda-primitives among them — in its workspace. When SciRS2 and NumRS2 crunch numbers, when ToRSh and TrustformeRS train and run models, when OxiONNX executes graphs, when OxiBLAS and OxiFFT provide linear algebra and transforms — OxiCUDA is what carries the work to the GPU, in pure Rust, from Turing to Blackwell and onto multi-vendor backends besides. 0.4.0 is the release where the kernels underneath all of that got checked against the hardware they were always meant to run on.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if you believe GPU kernels deserve to be tested against the silicon they actually run on, not just a CPU stand-in. Every star tells us to keep building.

Pure Rust GPU computing is here — fast, safe, sovereign, and now proven on real hardware.

KitaSan at COOLJAPAN OÜ July 1, 2026

↑ Back to all posts