OxiCUDA 0.4.0 Released — On-Device GPU Validation Catches What CPU Parity Tests Never Could

Every kernel passed its CPU-logic tests. Until today, most of them had never once executed on a real GPU.

Today we released OxiCUDA 0.4.0 — an on-device validation pass. For the first time, hand-written PTX kernels across more than 60 crates were JIT-compiled with Module::from_ptx and executed on real NVIDIA GPU hardware — an RTX A4000, sm_86, CUDA 12.4 — with their output checked against a CPU oracle, instead of only being checked for CPU-side logic parity. This surfaced and fixed dozens of genuine bugs that no amount of CPU-only testing could ever have caught.

No CUDA SDK. No nvcc. No C/C++ toolchain. OxiCUDA is a type-safe, memory-safe, pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack — cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND and more, across 73 crates and ~1.27M lines of safe Rust. The only thing it needs at runtime is the NVIDIA driver itself (libcuda.so / nvcuda.dll). PTX is generated directly from Rust, a built-in autotuner specializes kernels per GPU architecture from Turing through Blackwell, and the whole thing compiles to a single static binary — or to WASM, or onto multi-vendor backends — without a single line of C++ in your build. 0.4.0 is the release where that hand-written PTX finally met real silicon.

Why OxiCUDA 0.4.0 is a game changer

Every prior release validated PTX kernels the only way possible without a GPU in CI: hand-translate the kernel’s logic into Rust, run it on the CPU, and assert the two agree. That catches a large class of bugs — but it is structurally blind to anything that lives only in the PTX text itself: a kernel ptxas rejects outright, a register name that happens to collide with a CUDA built-in, an lg2.approx dropped in where the math needed ln. The CPU path never touches the actual assembly, so it can never catch what’s wrong with the actual assembly.

0.4.0 closes that blind spot. A feature-gated gpu-tests harness now JIT-compiles each crate’s real PTX via Module::from_ptx, launches it for real on an RTX A4000, and diffs the device output against the same CPU oracle the parity tests already trusted — rolled out across more than 60 crates, with every test skipping gracefully when no device is present.

The encouraging news first: 13 crates — oxicuda-pinn, oxicuda-bayes, oxicuda-federated, oxicuda-continual, oxicuda-peft, oxicuda-meta, oxicuda-tn, oxicuda-sketch, oxicuda-graphalg, oxicuda-cvx, oxicuda-gen, oxicuda-adversarial, oxicuda-hdc — ran 42+ kernels clean on the very first on-device pass, zero defects found. The rest of the story is what happens when hand-written PTX meets a real assembler and real silicon for the first time:

Register shadowing was the single most common defect. Hand-written .reg declarations were literally named %tid, %ntid, %ctaid, or %warpid — clobbering CUDA’s own built-in special registers. It hit oxicuda-primitives, all 9 optimizer kernels in oxicuda-train, three kernels in oxicuda-ann, all 5 kernels in both oxicuda-rl and oxicuda-dist-infer, and all 7 kernels in oxicuda-timeseries.
Base-2 math standing in for base-e. PTX’s fast native ex2.approx/lg2.approx were used as drop-in exp/ln in several places — plausible-looking, quietly wrong: Cox proportional-hazards scores 18–30% off in oxicuda-survival, HMM forward log-sum-exp ~30% off in oxicuda-seq, Sinkhorn iterations ~20% off in oxicuda-ot, plus hits in oxicuda-rlhf, oxicuda-nerf, oxicuda-gnn, and oxicuda-audio.
Kernels that never compiled at all. ptxas rejected invalid PTX outright across 15+ crates — undeclared registers, illegal shared-memory addressing, malformed branch labels, a missing .reg .pred. These had been sitting in the tree, never once run through a real assembler.
oxicuda-quantum’s entire statevector simulator had 8 stacked defects — partial gate matrix-vector products, wrong bit-insertion masks, an unguarded swap race, divergent-lane shfl.sync, wrong Taylor-series constants among them. Essentially every operation was computing something other than what it claimed.
oxicuda-solver’s LU and Cholesky panel-factorization kernels were literal ret; stub bodies — they compiled, launched, and returned immediately, performing no factorization at all. Now real bar.sync-staged implementations.

Every fix in this release was verified the honest way: fail, revert to confirm it fails again, re-apply, confirm it passes — against the actual device, not a CPU stand-in.

Technical Deep Dive: how the validation pass works

The harness. Each crate gets a gpu-tests Cargo feature and a src/gpu_tests.rs module. The pattern is uniform: JIT-compile the crate’s own PTX with Module::from_ptx, allocate device buffers, launch with real launch parameters, copy back, and diff against the crate’s existing CPU reference — the same oracle its CPU-parity tests already trusted. No device, no failure: tests skip rather than fail, so cargo test and CI stay green everywhere, and cargo test --features gpu-tests is what you run once a GPU is actually attached. Two crates, oxicuda-pde and oxicuda-numeric, matched their CPU oracle cleanly but still surfaced an honestly-documented caveat rather than a silent pass: pde’s pre-completion fem_assemble_kernel was confirmed to do only a signed-triangle-area scatter, and numeric’s bessel_recurrence is documented as aliasing J_0 across points when a launch batches more than one — both noted rather than hidden.

Algorithms that went from partial to complete. A few kernels turned out to be proxies rather than the full algorithm, and were finished in the same pass. oxicuda-pde’s fem_assemble_kernel is now the full unconstrained dense P1 stiffness assembly — the 3×3 local K_ij = (1/(4·Area))·(b_i·b_j + c_i·c_j) per element, atomically scattered into the global matrix, validated element-wise against the crate’s own p1_local_stiffness to 1e-4 relative / 1e-5 absolute. oxicuda-tabular’s sparsemax_kernel now runs the exact Martins & Astudillo largest-support search, quantile_norm_kernel now does true empirical-CDF interpolation, and node_tree_eval_kernel — previously hardcoded to 2 leaves regardless of depth — now runs the full multi-level NODE tree. oxicuda-moe’s soft_moe_dispatch_kernel went from a first-slot-only proxy to the real 3-pass slot-softmax dispatch matrix. oxicuda-geometry3d’s Gaussian-splatting project_kernel now emits the full EWA 2D covariance it previously never wrote at all, and sh_eval_kernel now evaluates all 9 L=0..2 spherical-harmonic terms instead of a reduced 5-term basis.

Honesty over fabrication. oxicuda-solver’s symmetric eigensolver and QR/SVD device paths were launching an incomplete GPU kernel and reading back values that were never actually computed on-device; they’re now an explicit, documented exact-CPU host fallback until the real GPU kernel lands. oxicuda-ssl had three kernels (barlow_cross_corr_wgmma, nt_xent_softmax_warp, gather_features_bulk) that turn out to be intentionally Hopper/Blackwell-only PTX (wgmma, redux.sync, TMA) — each is now documented as such, with an on-device-confirmed portable scalar fallback for Ampere and older.

New algorithmic surface. Alongside the validation sweep: oxicuda-cvx gained preconditioned conjugate gradient — pcg_solve / cg_solve_counted plus a Preconditioner trait with IdentityPrecond/JacobiPrecond, where Jacobi preconditioning cuts a κ=1e4 diagonal test system from 6 CG iterations down to 1. oxicuda-dnn gained GPT-NeoX half-split partial-rotary RoPE alongside the existing GPT-J/RoFormer interleaved variant. oxicuda-pinn replaced the FNO spectral path’s O(N²) brute-force DFT with a hand-written pure-Rust FFT (radix-2 Cooley-Tukey plus Bluestein for arbitrary N). oxicuda-tabular added VarObliviousLayer (variable-depth NODE oblivious trees) and TabRecordLayer (TabR-style retrieval attention). oxicuda-recsys implemented 4 previously-stub kernels: embedding_lookup, dot_score, bpr_gradient, lightgcn_propagate. oxicuda-webgpu moved to real WGSL parse+validate testing via naga across all 15 shader generators.

Getting Started

Add OxiCUDA and opt into the subsystems you need:

cargo add oxicuda --features blas

Default features are driver, memory, and launch. Each subsystem is its own feature flag (ptx, autotune, blas, dnn, fft, sparse, solver, rand, and full for everything) — and, if you have real GPU hardware and want to run the same on-device validation this release relies on, gpu-tests.

A complete GEMM, end to end:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    // Initialize driver and select GPU device
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    // Allocate device memory
    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_b = DeviceBuffer::<f32>::zeroed(1024)?;
    let mut d_c = DeviceBuffer::<f32>::zeroed(1024)?;

    // Copy host data to device
    d_a.copy_from_host(&host_a)?;
    d_b.copy_from_host(&host_b)?;

    // Launch a GEMM: C = alpha * A @ B + beta * C
    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32,            // alpha
        &d_a, lda,
        &d_b, ldb,
        0.0f32,            // beta
        &mut d_c, ldc,
    )?;

    stream.synchronize()?;

    // Copy result back to host
    let mut result = vec![0.0f32; m * n];
    d_c.copy_to_host(&mut result)?;
    Ok(())
}

On a machine with an NVIDIA GPU, run the same validation this release depends on:

cargo test --features gpu-tests

No nvcc invocation, no .cu files, no linker flags chasing a CUDA install. cargo build, and you ship.

What’s New in 0.4.0

On-device GPU validation harness — a feature-gated gpu-tests feature plus a src/gpu_tests.rs module per crate, JIT-compiling hand-written PTX via Module::from_ptx and asserting numerical equivalence to a CPU oracle on real hardware (RTX A4000, sm_86, CUDA 12.4). Rolled out across 60+ crates; 13 crates ran 42+ kernels clean on the first pass.
Register-shadowing fixes across oxicuda-primitives, oxicuda-train, oxicuda-ann, oxicuda-rl, oxicuda-dist-infer, and oxicuda-timeseries — CUDA’s built-in special registers were being redeclared as ordinary ones.
Base-2/base-e math fixes across oxicuda-survival, oxicuda-seq, oxicuda-ot, oxicuda-rlhf, oxicuda-nerf, oxicuda-gnn, and oxicuda-audio — ex2.approx/lg2.approx corrected to proper exp/ln scaling.
Invalid-PTX fixes across 15+ crates — kernels ptxas rejected outright now compile and are verified on sm_86.
oxicuda-quantum’s statevector simulator — 8 stacked defects fixed; every gate operation now computes what it claims.
oxicuda-solver’s LU/Cholesky panel factorization — was a literal ret; stub; now a real bar.sync-staged implementation.
Algorithm completions — full P1 FEM stiffness assembly (oxicuda-pde); exact sparsemax/quantile-norm/NODE-tree kernels plus new VarObliviousLayer/TabRecordLayer (oxicuda-tabular); real 3-pass soft-MoE dispatch (oxicuda-moe); full EWA covariance + 9-term spherical harmonics for Gaussian splatting (oxicuda-geometry3d).
New capability — preconditioned conjugate gradient (oxicuda-cvx), GPT-NeoX half-split RoPE (oxicuda-dnn), pure-Rust FFT for FNO spectral convolution (oxicuda-pinn).
Honest fallbacks over fabrication — oxicuda-solver’s eigensolver/QR/SVD device paths now use a documented exact-CPU fallback instead of returning never-computed values; oxicuda-ssl’s Hopper/Blackwell-only kernels are documented as such with confirmed portable fallbacks.
Test suite grew to 38,093 passing tests (--all-features; 37,166 default), up from 36,984 at 0.3.0, including large analytic test-coverage expansions across oxicuda-rlhf, oxicuda-recsys, oxicuda-peft, oxicuda-meta, oxicuda-numeric, oxicuda-evol, oxicuda-solver, and oxicuda-ann.

Tips

Turn on gpu-tests if you have real hardware. cargo test --features gpu-tests JIT-compiles the crate’s actual PTX and runs it against a live device — exactly what caught the bugs in this release. Running it on your own card (anything Turing through Blackwell) extends the same coverage to configurations we haven’t tried yet.
Reach for pcg_solve with JacobiPrecond before plain CG. For ill-conditioned systems, Jacobi preconditioning cut a κ=1e4 diagonal test system from 6 CG iterations to 1 in this release’s own validation — swap in pcg_solve and check whether your iteration count drops.
Use TabRecordLayer when rows have a strong “similar row” signal. It’s TabR-style retrieval — encode, scaled −L2 similarity, entmax attention, convex combination — a genuinely different inductive bias from the tree-structured VarObliviousLayer, and both now share oxicuda-tabular’s entmax/sparsemax simplex code.
Match the RoPE variant to your checkpoint, not just the newest option. oxicuda-dnn now offers both GPT-NeoX half-split (NeoXRopeConfig / apply_rope_neox_half_split) and the earlier GPT-J/RoFormer interleaved rotation. They rotate different pairs of dimensions — mixing them silently corrupts positional information.
If you contribute hand-written PTX, never name a register %tid, %ntid, %ctaid, or %warpid. That naming collision was this release’s single most common defect class, silently clobbering the real special register. Pick any other name.
Treat ex2.approx/lg2.approx as base-2, not a free exp/ln. Several kernels used them as drop-in replacements and produced plausible-but-wrong output that only a real device run against a CPU oracle exposed. Apply the log2(e)/ln(2) scale factor explicitly.

This is the foundation

OxiCUDA is the GPU layer beneath the rest of the COOLJAPAN ecosystem, and this release directly derisks that role for everything built on top of it. ToRSh 0.1.3’s new CudaBackend reaches into oxicuda-backend, oxicuda-driver, oxicuda-launch, and oxicuda-ptx; sklears 0.1.2 carries the OxiCUDA v0.3 family — oxicuda-blas, oxicuda-solver, oxicuda-manifold, oxicuda-dnn, oxicuda-primitives among them — in its workspace. When SciRS2 and NumRS2 crunch numbers, when ToRSh and TrustformeRS train and run models, when OxiONNX executes graphs, when OxiBLAS and OxiFFT provide linear algebra and transforms — OxiCUDA is what carries the work to the GPU, in pure Rust, from Turing to Blackwell and onto multi-vendor backends besides. 0.4.0 is the release where the kernels underneath all of that got checked against the hardware they were always meant to run on.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if you believe GPU kernels deserve to be tested against the silicon they actually run on, not just a CPU stand-in. Every star tells us to keep building.

Pure Rust GPU computing is here — fast, safe, sovereign, and now proven on real hardware.

— KitaSan at COOLJAPAN OÜ July 1, 2026