OxiCUDA 0.1.8 Released — Numerical-Stability and Allocator Tuning Polish

Small but solid: a maintenance release that sharpens the numbers and the memory path under the whole stack.

Today we released OxiCUDA 0.1.8 — a maintenance release that tightens numerical stability in the HMC variational sampler and tunes the stream-ordered allocator, with a little reduction-quality polish on the side.

No CUDA SDK. No nvcc. No C/C++ toolchain at build time. OxiCUDA is a type-safe, memory-safe pure-Rust replacement for the entire NVIDIA CUDA Toolkit software stack — cuBLAS, cuDNN, cuFFT, cuSPARSE, cuSOLVER, cuRAND, and the foundation beneath them. The only runtime dependency is the NVIDIA driver itself (libcuda.so / nvcuda.dll). PTX is generated directly from Rust, and a built-in autotuner specializes kernels per GPU architecture from Turing through Blackwell. The result is a single static binary — or a WASM module — with multi-vendor portability backends (Metal, Vulkan, WebGPU, ROCm, LevelZero) underneath the same API.

Why 0.1.8 matters

This is a small release, and we want to be honest about that. There are no new subsystems here, no headline feature. What there is, is the kind of work that a numerical GPU stack lives or dies by: stability and tuning.

That work earns its keep precisely because the failure modes it addresses are quiet ones. A Hamiltonian Monte Carlo sampler is sensitive to numerical error — step-size choice and leapfrog integration can drift, reject more proposals than they should, or subtly bias a posterior without ever throwing an error. A pooled, stream-ordered allocator sits on the hot path for throughput: get its reuse and fragmentation behavior wrong and you pay for it in sync overhead and memory pressure, not in a stack trace. And the quality of a dimensionality-reduction embedding shapes everything downstream that consumes it.

None of these show up as crashes. They show up as results that are a little off, or a little slow. The way you keep them honest is a high regression bar, and at 0.1.8 the workspace carries 23,535 passing tests. That number is the point of a release like this one.

Technical Deep Dive: What Changed Under the Hood

Three changes, each in a different corner of the stack.

(a) HMC variational sampler stability — oxicuda-bayes. The Bayesian deep-learning crate provides variational inference, normalizing flows, ELBO/IWAE objectives, MC Dropout, Laplace approximation, and calibration (including ECE). Its Hamiltonian Monte Carlo sampler is one of the more numerically delicate pieces: leapfrog integration accumulates floating-point error across steps, and an unstable step size can degrade acceptance and distort the sampled posterior. This release refines the numerical handling in that path so the sampler holds together more tightly across step-size and trajectory-length regimes, giving cleaner posteriors without changing the public API.

(b) Stream-ordered allocator tuning — oxicuda-driver. Stream-ordered allocation is the cudaMallocAsync-style model: instead of a global synchronous malloc/free, allocations and frees are ordered with respect to a particular stream, and freed blocks are returned to a pool that can satisfy later allocations on that stream without a hard device sync. It is a strong fit for high-churn workloads where the same shapes are allocated and freed repeatedly. The 0.1.8 tuning improves the pool’s reuse and fragmentation behavior so that those churn-heavy patterns spend less time fragmenting memory and less time synchronizing.

(c) TriMap reduction polish — oxicuda-manifold. TriMap is a triplet-based dimensionality-reduction method that lives alongside the UMAP- and t-SNE-style embeddings in oxicuda-manifold. It is particularly good at preserving global structure. This release polishes the reduction so the resulting embeddings are a little cleaner and more consistent — a quiet quality improvement for anything that feeds on those coordinates.

All three ride on the same broader stack: the Foundation layer (driver/memory/launch/runtime), the PTX codegen and autotuner, the BLAS layer (cuBLAS-equivalent, including the Tensor-Core SYRK/SYR2K family), the DNN layer (cuDNN-equivalent), the scientific-computing layers (FFT/sparse/solver/rand), and the seven portability backends.

Getting Started

Add OxiCUDA and opt into the subsystem you need:

cargo add oxicuda --features blas

The default features are ["driver", "memory", "launch"] — the Foundation layer. Everything above it is feature-gated, so you pull in blas, dnn, fft, sparse, solver, rand, and the rest only when you use them.

A minimal GEMM looks like this:

use oxicuda::prelude::*;

fn main() -> Result<(), oxicuda::Error> {
    let device = Device::get(0)?;
    let ctx = Context::new(device)?;
    let stream = Stream::new(&ctx)?;

    let mut d_a = DeviceBuffer::<f32>::zeroed(1024)?;
    d_a.copy_from_host(&host_a)?;

    let handle = BlasHandle::new(&stream)?;
    handle.gemm(
        Transpose::None, Transpose::None,
        m, n, k,
        1.0f32, &d_a, lda, &d_b, ldb,
        0.0f32, &mut d_c, ldc,
    )?;
    stream.synchronize()?;
    Ok(())
}

The handle is bound to a stream, the buffers live on the device, and the whole thing is type-checked and memory-safe — no raw pointers crossing the FFI boundary, because there is no FFI boundary beyond the driver.

What’s New in 0.1.8

This release is exactly three changes:

HMC variational sampler stability — numerical-stability refinements in the Hamiltonian Monte Carlo sampler in oxicuda-bayes, for tighter, cleaner posteriors.
Stream-ordered allocator tuning — the cudaMallocAsync-style pooled, stream-ordered allocator in oxicuda-driver was tuned for better reuse and less fragmentation under high-churn allocation.
TriMap reduction polish — quality polish on the TriMap triplet-based dimensionality reduction in oxicuda-manifold.

All of it lands against a reliability bar of 23,535 passing tests across the workspace.

Tips

Reach for the stream-ordered (async) allocator on high-churn workloads. If your code allocates and frees the same shapes over and over — per-step scratch buffers, batched intermediates — the pooled, stream-ordered allocation path cuts both fragmentation and synchronization overhead. Tie the allocation to the stream that consumes it and let the pool recycle freed blocks rather than going back to the device each time.
HMC users get this fix for free. If you sample with the Hamiltonian Monte Carlo path in oxicuda-bayes, the stability refinements help across step-size and trajectory-length settings — expect tighter posteriors with no code changes.
Consider TriMap when global structure matters. Among the embeddings in oxicuda-manifold, TriMap is a solid choice when you care about preserving the large-scale layout of your data rather than only local neighborhoods.
Pin to 0.1.8 to get the stability fixes. If your work touches HMC sampling or allocation-heavy kernels, this is the version you want.
Keep your feature flags minimal. Defaults are just driver/memory/launch; add blas, fft, solver, and friends only as you use them, and your build stays lean.

This is the foundation

OxiCUDA is the GPU layer beneath the rest of the COOLJAPAN stack. The libraries already shipping on top of it — SciRS2 and NumRS2 for scientific and array computing, OxiBLAS and OxiFFT for dense linear algebra and transforms, ToRSh and TrustformeRS and OxiLLaMa for deep learning and language models, OxiONNX for model interchange — all benefit when the layer underneath gets a little more stable and a little more efficient. A maintenance release here is maintenance for everything above it.

Repository: https://github.com/cool-japan/oxicuda

Star the repo if you want a CUDA Toolkit you can build without nvcc, and follow along as the foundation keeps getting steadier.

Pure Rust GPU computing is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ May 21, 2026