COOLJAPAN
← All posts

SciRS2 0.6.0 Released — Two GPU Stories, One Decentralized Core

SciRS2 0.6.0 introduces the pure-Rust oxicuda-* CUDA stack as a direct, per-crate NVIDIA performance backend and decentralizes GPU out of scirs2-core. Ten crates — fft, symbolic, interpolate, special, stats, graph, linalg, optimize, datasets, and vision — gain an off-by-default, runtime-probed, f64-native cuda feature standing alongside the existing wgpu/WebGPU portability path, now standardized under one wgpu feature name across the ecosystem. A default build still compiles zero oxicuda. Pure Rust, Apache-2.0.

release scirs2 rust scientific-computing pure-rust gpu cuda oxicuda wgpu

SciRS2’s GPU story just split in two — portability in the core, raw NVIDIA performance in every crate that wants it, and not a single line of C in either path.

Today we released SciRS2 0.6.0 — the release where GPU acceleration stops being one abstraction owned by scirs2-core and becomes two deliberate, independent stories owned by the crates that actually need them. 0.5.1 made GPU/CUDA reporting honest; 0.6.0 goes further and gives ten crates a real, direct, NVIDIA-only CUDA backend of their own — the pure-Rust oxicuda-* stack — while the portable wgpu/WebGPU path that already lived in scirs2-core stays exactly where it is, now under a single, consistent feature name across the whole ecosystem.

No C. No Fortran. No CUDA Toolkit, no nvcc, no NVIDIA C++ headers vendored anywhere in the tree — and now, no cudarc either. 0.6.0 deletes scirs2-core’s old cudarc-based CUDA backend (gpu/backends/cuda.rs) outright and drops the cudarc dependency, a clean Pure-Rust win: GpuBackend::Cuda survives only as an honest enum tag whose context constructor tells you plainly to go use a crate’s own oxicuda-* cuda feature, instead of fabricating a context it can’t actually build. Every one of the ten new per-crate cuda paths is built the same way — directly against oxicuda-driver, oxicuda-memory, oxicuda-blas, oxicuda-solver, oxicuda-sparse, oxicuda-ptx, oxicuda-fft, and now oxicuda-dnn — pure-Rust PTX generation and CUDA driver JIT compilation, not a wrapped CUDA Toolkit and not a CPU/wgpu simulation wearing a CUDA label.

This is a minor release by number, but an architectural one: ten crates now carry their own NVIDIA-only cuda feature, every one of them off by default, so a default build still compiles zero oxicuda — the pure-Rust CUDA stack is there the moment you opt in, invisible the moment you don’t.

Why SciRS2 0.6.0 is a game changer

Centralized GPU abstractions are convenient right up until they become a bottleneck: every crate that wants NVIDIA performance has to route through one shared module, wait for that module’s API surface to grow to fit its needs, and inherit whatever compromises that module made for the crates that came before it. 0.6.0’s answer is architectural, not cosmetic — decentralize. Let each crate own its CUDA story, talk to oxicuda-* directly, and leave the portable wgpu path exactly where it already worked.

Concrete 0.6.0 wins:

Technical Deep Dive: two stories, one runtime probe

Three modules, two stories, one crate. scirs2-linalg’s own module map makes the split concrete: gpu is a self-contained local GPU abstraction with its own GpuContext trait (cuda/opencl/rocm/metal, unrelated to scirs2-core); gpu_linalg is the portability path, built on scirs2_core::gpu’s wgpu/WebGPU layer, f32, with CPU fallback; and the new gpu_cuda is the performance path — NVIDIA-only, f64, real CUDA via oxicuda-*, no fallback of its own. Every function in gpu_cuda degrades safely instead of panicking: cuda_is_available() checks whether oxicuda_driver::init() succeeds and at least one device is visible, and every compute entry point returns a LinalgError::ComputationError rather than aborting when no device is present.

cuda_gemm and cuda_solve_spd, concretely. cuda_gemm(a: &ArrayView2<f64>, b: &ArrayView2<f64>) -> LinalgResult<Array2<f64>> validates the inner dimensions, forces both operands into contiguous row-major host slices, uploads them via oxicuda_memory::DeviceBuffer, describes all three matrices as Layout::RowMajor, and dispatches oxicuda_blas::level3::gemm_api::gemm::<f64>. cuda_solve_spd(a: &ArrayView2<f64>, b: &ArrayView1<f64>) -> LinalgResult<Array1<f64>> factors the symmetric positive-definite A in place with oxicuda_solver::dense::cholesky (FillMode::Lower) and then calls cholesky_solve — the same layout pattern already proven in scirs2-interpolate’s cuda_rbf_solve. Neither function bakes in a CPU fallback; that decision is left to the caller, which is exactly the point — see Getting Started below.

Honesty, continued. 0.5.1 taught scirs2-core, scirs2-linalg, and scirs2-fft’s GPU paths to report BackendNotAvailable instead of fabricating a context. 0.6.0 extends that same discipline to the new oxicuda paths from day one: cuda_is_available() never panics, the internal context builder returns a real LinalgError::ComputationError the moment device count is zero, and every oxicuda-backed function keeps a CPU source of truth reachable behind its own runtime probe. Nothing in the new performance path is a wgpu computation wearing a CUDA label, and nothing silently downgrades precision or substitutes a wrong answer.

One feature name, ecosystem-wide. Before 0.6.0, the portable wgpu path had a different feature name in almost every crate that carried it: wgpu_backend in scirs2-core, gpu_wgpu in datasets/stats, wgpu_fft in fft, wgpu_kernels in special, wgpu_rbf in interpolate, and a bare gpu in vision/graph/optimize. 0.6.0 collapses all of them to one name — wgpu — so every crate touched by this release now exposes the same, predictable pair: cuda for NVIDIA performance, wgpu for cross-platform portability. It’s a small rename with an ecosystem-wide blast radius, which is why it’s called out as breaking below.

Getting Started

Add the crate with the new NVIDIA CUDA path enabled:

cargo add scirs2-linalg --features cuda

cuda is off by default, NVIDIA-only, and entirely additive — it changes nothing for a build that doesn’t request it. The pattern below is the one to copy: probe with cuda_is_available(), try the GPU path, and fall back to the CPU explicitly on any error rather than letting a device-specific failure propagate into code that should run anywhere:

use ndarray::{array, Array1, Array2};
use scirs2_linalg::gpu_cuda::{cuda_gemm, cuda_is_available, cuda_solve_spd};

fn main() {
    // NVIDIA-only, runtime-probed via the CUDA driver itself.
    // cuda_is_available() never panics, so it's safe to call on any machine.
    println!("CUDA device present: {}", cuda_is_available());

    let a: Array2<f64> = array![
        [4.0, 1.0, 0.0],
        [1.0, 3.0, 1.0],
        [0.0, 1.0, 2.0],
    ];
    let b: Array2<f64> = array![[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]];

    // f64 GEMM on the GPU when a device is present. With no CUDA device,
    // cuda_gemm returns an honest LinalgError::ComputationError instead of a
    // fabricated or silently-wrong result, so the CPU fallback below is
    // always an explicit branch you write, never a hidden default.
    let c: Array2<f64> = match cuda_gemm(&a.view(), &b.view()) {
        Ok(gpu_result) => gpu_result,
        Err(_) => a.dot(&b),
    };
    println!("A . B =\n{c:?}");

    // Solve the same symmetric positive-definite A via GPU Cholesky
    // factorization plus triangular solve, falling back to the CPU solver
    // on the same honest-error contract.
    let rhs: Array1<f64> = array![1.0, 2.0, 3.0];
    let x = match cuda_solve_spd(&a.view(), &rhs.view()) {
        Ok(gpu_x) => gpu_x,
        Err(_) => match scirs2_linalg::solve(&a.view(), &rhs.view(), None) {
            Ok(cpu_x) => cpu_x,
            Err(e) => {
                eprintln!("CPU solve failed too: {e}");
                return;
            }
        },
    };
    println!("x =\n{x:?}");
}

Every oxicuda-* path in 0.6.0 follows this same shape: a cheap, panic-free probe, an honest error instead of a fabricated result when no device is present, and a CPU path you control explicitly rather than one the library silently substitutes for you.

What’s New in 0.6.0

Added

Changed

Removed

Docs

Tips

  1. Rename your wgpu feature flag — six old names, one new one. The portable WebGPU path is now just wgpu everywhere: wgpu_backendwgpu in scirs2-core, gpuwgpu in vision/graph/optimize, gpu_wgpuwgpu in datasets/stats, wgpu_fftwgpu in fft, wgpu_kernelswgpu in special, and wgpu_rbfwgpu in interpolate. If your Cargo.toml, CI config, or build scripts still reference one of the old names, update it — Cargo will reject an unknown feature name rather than silently ignoring it.
  2. Know what did not rename. scirs2-core’s gpu umbrella feature, array_protocol_wgpu, and scirs2-stats’ gpu core-abstraction passthrough all keep their existing names — don’t touch those. The empty placeholder features scirs2-integrate/gpu_fem and scirs2-interpolate/gpu_kdtree are untouched too; they were never wired to a real dep:wgpu in the first place, so there was nothing to rename.
  3. cuda and wgpu now form one consistent pair per crate. Every crate this release touches exposes both: cuda for direct NVIDIA performance (new, off by default) and wgpu for cross-platform portability (renamed, same behavior as before). Enable either independently, both together, or neither — they don’t conflict, and enabling one never silently pulls in the other.
  4. Turning on cuda costs you nothing without a GPU. It’s a runtime probe, not a hard requirement: cuda_is_available() never panics, and every oxicuda-backed function (cuda_gemm, cuda_solve_spd, cuda_convolve_2d, and the rest) returns a normal error instead of aborting when no NVIDIA device is present. It’s safe to enable in a shared Cargo.toml even if half your fleet has no GPU at all.
  5. A default build still compiles zero oxicuda. All ten new cuda features are additive and off by default, so if you haven’t opted in explicitly, nothing about your build, binary size, or compile time has changed in 0.6.0.
  6. Write your CPU fallback explicitly — the library won’t do it for you. None of the new cuda_* functions fall back to the CPU internally on error; that’s intentional, so you always see and control the branch. Match on the Result, as in Getting Started above, rather than assuming a GPU failure degrades gracefully on its own.

This is the foundation

SciRS2 0.6.0 is the sovereign scientific-computing layer of the COOLJAPAN ecosystem — and a GPU-architecture release matters most precisely because so much is built on top of it:

Every one of these inherits SciRS2’s GPU story along with its gradients, its error types, and its dependency tree — so when scirs2-linalg gets a direct cuda_gemm, or scirs2-optimize gets a direct cuda_hessian_vector_product, the acceleration lands exactly where the computation already lives, with no detour through a shared core module standing between the crate and the hardware. The numeric core still rests on OxiBLAS and OxiFFT; verification on OxiZ; the new NVIDIA performance path on OxiCUDA; compression on OxiARC; storage on OxiSQL and OxiH5. No C. No Fortran. No exceptions in the default build.

Repository: https://github.com/cool-japan/scirs

Star the repo if a GPU story that’s honest about what’s portable and what’s NVIDIA-only — and doesn’t need a single line of C to be fast — is what you’ve been waiting for.

Pure Rust scientific computing — decentralized on GPU, sovereign to the core.

KitaSan at COOLJAPAN OÜ July 1, 2026

↑ Back to all posts