SciRS2 0.6.0 Released — Two GPU Stories, One Decentralized Core

SciRS2’s GPU story just split in two — portability in the core, raw NVIDIA performance in every crate that wants it, and not a single line of C in either path.

Today we released SciRS2 0.6.0 — the release where GPU acceleration stops being one abstraction owned by scirs2-core and becomes two deliberate, independent stories owned by the crates that actually need them. 0.5.1 made GPU/CUDA reporting honest; 0.6.0 goes further and gives ten crates a real, direct, NVIDIA-only CUDA backend of their own — the pure-Rust oxicuda-* stack — while the portable wgpu/WebGPU path that already lived in scirs2-core stays exactly where it is, now under a single, consistent feature name across the whole ecosystem.

No C. No Fortran. No CUDA Toolkit, no nvcc, no NVIDIA C++ headers vendored anywhere in the tree — and now, no cudarc either. 0.6.0 deletes scirs2-core’s old cudarc-based CUDA backend (gpu/backends/cuda.rs) outright and drops the cudarc dependency, a clean Pure-Rust win: GpuBackend::Cuda survives only as an honest enum tag whose context constructor tells you plainly to go use a crate’s own oxicuda-* cuda feature, instead of fabricating a context it can’t actually build. Every one of the ten new per-crate cuda paths is built the same way — directly against oxicuda-driver, oxicuda-memory, oxicuda-blas, oxicuda-solver, oxicuda-sparse, oxicuda-ptx, oxicuda-fft, and now oxicuda-dnn — pure-Rust PTX generation and CUDA driver JIT compilation, not a wrapped CUDA Toolkit and not a CPU/wgpu simulation wearing a CUDA label.

This is a minor release by number, but an architectural one: ten crates now carry their own NVIDIA-only cuda feature, every one of them off by default, so a default build still compiles zero oxicuda — the pure-Rust CUDA stack is there the moment you opt in, invisible the moment you don’t.

Why SciRS2 0.6.0 is a game changer

Centralized GPU abstractions are convenient right up until they become a bottleneck: every crate that wants NVIDIA performance has to route through one shared module, wait for that module’s API surface to grow to fit its needs, and inherit whatever compromises that module made for the crates that came before it. 0.6.0’s answer is architectural, not cosmetic — decentralize. Let each crate own its CUDA story, talk to oxicuda-* directly, and leave the portable wgpu path exactly where it already worked.

Concrete 0.6.0 wins:

Ten crates get a direct oxicuda cuda feature, off by default. scirs2-fft (1D C2C FFT via oxicuda-fft), scirs2-symbolic (LoweredOp-to-PTX custom-kernel batch evaluation via oxicuda-ptx), scirs2-interpolate (RBF GEMM + Cholesky solve via oxicuda-blas + oxicuda-solver), scirs2-special (custom erf PTX kernel), scirs2-stats (normal PDF/CDF PTX kernels), scirs2-graph (CSR sparse matrix-vector multiply via oxicuda-sparse), scirs2-linalg (cuda_gemm + cuda_solve_spd via oxicuda-blas + oxicuda-solver), scirs2-optimize (cuda_hessian_vector_product via GEMV), scirs2-datasets (cuda_regression_target via GEMV), and scirs2-vision (cuda_convolve_2d via oxicuda-dnn).
f64-native, not f32. Every oxicuda path is double precision — the concrete advantage a real CUDA backend has over the existing f32-only wgpu/WebGPU path.
A default build compiles zero oxicuda. All ten cuda features are additive and off by default; nothing changes for a build that never opts in.
GPU decentralized. scirs2-core is no longer the GPU aggregation hub — each crate now owns its CUDA story directly instead of routing through core.
Policy caught up with the architecture. SCIRS2_POLICY.md’s GPU Operations Policy is rewritten to the two-stories model: wgpu/WebGPU for portability (retained in core), oxicuda-* for NVIDIA performance (per-crate, never through core).

Technical Deep Dive: two stories, one runtime probe

Three modules, two stories, one crate. scirs2-linalg’s own module map makes the split concrete: gpu is a self-contained local GPU abstraction with its own GpuContext trait (cuda/opencl/rocm/metal, unrelated to scirs2-core); gpu_linalg is the portability path, built on scirs2_core::gpu’s wgpu/WebGPU layer, f32, with CPU fallback; and the new gpu_cuda is the performance path — NVIDIA-only, f64, real CUDA via oxicuda-*, no fallback of its own. Every function in gpu_cuda degrades safely instead of panicking: cuda_is_available() checks whether oxicuda_driver::init() succeeds and at least one device is visible, and every compute entry point returns a LinalgError::ComputationError rather than aborting when no device is present.

cuda_gemm and cuda_solve_spd, concretely. cuda_gemm(a: &ArrayView2<f64>, b: &ArrayView2<f64>) -> LinalgResult<Array2<f64>> validates the inner dimensions, forces both operands into contiguous row-major host slices, uploads them via oxicuda_memory::DeviceBuffer, describes all three matrices as Layout::RowMajor, and dispatches oxicuda_blas::level3::gemm_api::gemm::<f64>. cuda_solve_spd(a: &ArrayView2<f64>, b: &ArrayView1<f64>) -> LinalgResult<Array1<f64>> factors the symmetric positive-definite A in place with oxicuda_solver::dense::cholesky (FillMode::Lower) and then calls cholesky_solve — the same layout pattern already proven in scirs2-interpolate’s cuda_rbf_solve. Neither function bakes in a CPU fallback; that decision is left to the caller, which is exactly the point — see Getting Started below.

Honesty, continued. 0.5.1 taught scirs2-core, scirs2-linalg, and scirs2-fft’s GPU paths to report BackendNotAvailable instead of fabricating a context. 0.6.0 extends that same discipline to the new oxicuda paths from day one: cuda_is_available() never panics, the internal context builder returns a real LinalgError::ComputationError the moment device count is zero, and every oxicuda-backed function keeps a CPU source of truth reachable behind its own runtime probe. Nothing in the new performance path is a wgpu computation wearing a CUDA label, and nothing silently downgrades precision or substitutes a wrong answer.

One feature name, ecosystem-wide. Before 0.6.0, the portable wgpu path had a different feature name in almost every crate that carried it: wgpu_backend in scirs2-core, gpu_wgpu in datasets/stats, wgpu_fft in fft, wgpu_kernels in special, wgpu_rbf in interpolate, and a bare gpu in vision/graph/optimize. 0.6.0 collapses all of them to one name — wgpu — so every crate touched by this release now exposes the same, predictable pair: cuda for NVIDIA performance, wgpu for cross-platform portability. It’s a small rename with an ecosystem-wide blast radius, which is why it’s called out as breaking below.

Getting Started

Add the crate with the new NVIDIA CUDA path enabled:

cargo add scirs2-linalg --features cuda

cuda is off by default, NVIDIA-only, and entirely additive — it changes nothing for a build that doesn’t request it. The pattern below is the one to copy: probe with cuda_is_available(), try the GPU path, and fall back to the CPU explicitly on any error rather than letting a device-specific failure propagate into code that should run anywhere:

use ndarray::{array, Array1, Array2};
use scirs2_linalg::gpu_cuda::{cuda_gemm, cuda_is_available, cuda_solve_spd};

fn main() {
    // NVIDIA-only, runtime-probed via the CUDA driver itself.
    // cuda_is_available() never panics, so it's safe to call on any machine.
    println!("CUDA device present: {}", cuda_is_available());

    let a: Array2<f64> = array![
        [4.0, 1.0, 0.0],
        [1.0, 3.0, 1.0],
        [0.0, 1.0, 2.0],
    ];
    let b: Array2<f64> = array![[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]];

    // f64 GEMM on the GPU when a device is present. With no CUDA device,
    // cuda_gemm returns an honest LinalgError::ComputationError instead of a
    // fabricated or silently-wrong result, so the CPU fallback below is
    // always an explicit branch you write, never a hidden default.
    let c: Array2<f64> = match cuda_gemm(&a.view(), &b.view()) {
        Ok(gpu_result) => gpu_result,
        Err(_) => a.dot(&b),
    };
    println!("A . B =\n{c:?}");

    // Solve the same symmetric positive-definite A via GPU Cholesky
    // factorization plus triangular solve, falling back to the CPU solver
    // on the same honest-error contract.
    let rhs: Array1<f64> = array![1.0, 2.0, 3.0];
    let x = match cuda_solve_spd(&a.view(), &rhs.view()) {
        Ok(gpu_x) => gpu_x,
        Err(_) => match scirs2_linalg::solve(&a.view(), &rhs.view(), None) {
            Ok(cpu_x) => cpu_x,
            Err(e) => {
                eprintln!("CPU solve failed too: {e}");
                return;
            }
        },
    };
    println!("x =\n{x:?}");
}

Every oxicuda-* path in 0.6.0 follows this same shape: a cheap, panic-free probe, an honest error instead of a fabricated result when no device is present, and a CPU path you control explicitly rather than one the library silently substitutes for you.

What’s New in 0.6.0

Added

oxicuda-* CUDA backends across ten crates — a new off-by-default, NVIDIA-only, runtime-probed cuda feature and gpu_cuda module in each, every path f64-native, a default build compiling zero oxicuda: scirs2-fft (oxicuda-fft, 1D C2C FFT with 1/N inverse normalization), scirs2-symbolic (oxicuda-ptx, LoweredOp-to-PTX custom-kernel f64 batch evaluation), scirs2-interpolate (oxicuda-blas + oxicuda-solver, RBF GEMM plus Cholesky solve), scirs2-special (oxicuda-ptx, custom erf PTX kernel), scirs2-stats (oxicuda-ptx, normal PDF/CDF PTX kernels), scirs2-graph (oxicuda-sparse, CSR SpMV), scirs2-linalg (oxicuda-blas + oxicuda-solver, cuda_gemm + cuda_solve_spd), scirs2-optimize (oxicuda-blas, cuda_hessian_vector_product via GEMV), scirs2-datasets (oxicuda-blas, cuda_regression_target via GEMV), scirs2-vision (oxicuda-dnn, cuda_convolve_2d).
Workspace: oxicuda-dnn added to [workspace.dependencies], joining the existing oxicuda-driver / -memory / -fft / -ptx / -launch / -blas / -solver / -sparse path dependencies.

Changed

BREAKING (GPU features): every real (dep:wgpu) wgpu/WebGPU portability feature standardized under a single name, wgpu — scirs2-core‘s wgpu_backend, the per-crate gpu in vision/graph/optimize, gpu_wgpu in datasets/stats, wgpu_fft in fft, wgpu_kernels in special, and wgpu_rbf in interpolate all become wgpu. Every touched crate now exposes a consistent cuda (oxicuda) + wgpu (portable) pair. Unchanged: core’s gpu umbrella feature, array_protocol_wgpu, stats’ gpu core-abstraction passthrough, and the empty placeholder flags scirs2-integrate/gpu_fem and scirs2-interpolate/gpu_kdtree.
GPU decentralized: scirs2-core is no longer the GPU aggregation hub — each crate now owns its CUDA story through a direct oxicuda-* dependency instead of routing through core.
GPU policy: SCIRS2_POLICY.md’s GPU Operations Policy rewritten to the two-stories model — wgpu/WebGPU for portability (retained in core), oxicuda-* for NVIDIA performance (per-crate, never through core).

Removed

scirs2-core: retired core’s own cudarc-based CUDA backend — deleted gpu/backends/cuda.rs and dropped the cudarc dependency (a Pure-Rust win); GpuBackend::Cuda now survives only as an enum tag whose context constructor returns an honest error directing users to the per-crate oxicuda-* cuda features, while the portable wgpu GpuNdarray/WebGPUContext is retained.
scirs2-core: removed the now-unused cuda and dead array_protocol_cuda features.
scirs2-cluster: removed dead GPU feature aliases (opencl/metal/oneapi).

Docs

scirs2-spatial: corrected GPU documentation to reflect reality — its gpu_accel path is a CPU-SIMD fallback, not a GPU backend.

Tips

Rename your wgpu feature flag — six old names, one new one. The portable WebGPU path is now just wgpu everywhere: wgpu_backend → wgpu in scirs2-core, gpu → wgpu in vision/graph/optimize, gpu_wgpu → wgpu in datasets/stats, wgpu_fft → wgpu in fft, wgpu_kernels → wgpu in special, and wgpu_rbf → wgpu in interpolate. If your Cargo.toml, CI config, or build scripts still reference one of the old names, update it — Cargo will reject an unknown feature name rather than silently ignoring it.
Know what did not rename. scirs2-core’s gpu umbrella feature, array_protocol_wgpu, and scirs2-stats’ gpu core-abstraction passthrough all keep their existing names — don’t touch those. The empty placeholder features scirs2-integrate/gpu_fem and scirs2-interpolate/gpu_kdtree are untouched too; they were never wired to a real dep:wgpu in the first place, so there was nothing to rename.
cuda and wgpu now form one consistent pair per crate. Every crate this release touches exposes both: cuda for direct NVIDIA performance (new, off by default) and wgpu for cross-platform portability (renamed, same behavior as before). Enable either independently, both together, or neither — they don’t conflict, and enabling one never silently pulls in the other.
Turning on cuda costs you nothing without a GPU. It’s a runtime probe, not a hard requirement: cuda_is_available() never panics, and every oxicuda-backed function (cuda_gemm, cuda_solve_spd, cuda_convolve_2d, and the rest) returns a normal error instead of aborting when no NVIDIA device is present. It’s safe to enable in a shared Cargo.toml even if half your fleet has no GPU at all.
A default build still compiles zero oxicuda. All ten new cuda features are additive and off by default, so if you haven’t opted in explicitly, nothing about your build, binary size, or compile time has changed in 0.6.0.
Write your CPU fallback explicitly — the library won’t do it for you. None of the new cuda_* functions fall back to the CPU internally on error; that’s intentional, so you always see and control the branch. Match on the Result, as in Getting Started above, rather than assuming a GPU failure degrades gracefully on its own.

This is the foundation

SciRS2 0.6.0 is the sovereign scientific-computing layer of the COOLJAPAN ecosystem — and a GPU-architecture release matters most precisely because so much is built on top of it:

NumRS2 — NumPy-compatible N-dimensional arrays in pure Rust.
PandRS — Pandas-compatible DataFrames.
OptiRS — advanced ML optimizers extending SciRS2.
ToRSh — a PyTorch-compatible deep-learning framework.
SkleaRS — a scikit-learn-compatible ML library.
TrustformeRS — Hugging Face Transformers in pure Rust.

Every one of these inherits SciRS2’s GPU story along with its gradients, its error types, and its dependency tree — so when scirs2-linalg gets a direct cuda_gemm, or scirs2-optimize gets a direct cuda_hessian_vector_product, the acceleration lands exactly where the computation already lives, with no detour through a shared core module standing between the crate and the hardware. The numeric core still rests on OxiBLAS and OxiFFT; verification on OxiZ; the new NVIDIA performance path on OxiCUDA; compression on OxiARC; storage on OxiSQL and OxiH5. No C. No Fortran. No exceptions in the default build.

Repository: https://github.com/cool-japan/scirs

Star the repo if a GPU story that’s honest about what’s portable and what’s NVIDIA-only — and doesn’t need a single line of C to be fast — is what you’ve been waiting for.

Pure Rust scientific computing — decentralized on GPU, sovereign to the core.

— KitaSan at COOLJAPAN OÜ July 1, 2026