SciRS2’s GPU story just split in two — portability in the core, raw NVIDIA performance in every crate that wants it, and not a single line of C in either path.
Today we released SciRS2 0.6.0 — the release where GPU acceleration stops being one abstraction owned by scirs2-core and becomes two deliberate, independent stories owned by the crates that actually need them. 0.5.1 made GPU/CUDA reporting honest; 0.6.0 goes further and gives ten crates a real, direct, NVIDIA-only CUDA backend of their own — the pure-Rust oxicuda-* stack — while the portable wgpu/WebGPU path that already lived in scirs2-core stays exactly where it is, now under a single, consistent feature name across the whole ecosystem.
No C. No Fortran. No CUDA Toolkit, no nvcc, no NVIDIA C++ headers vendored anywhere in the tree — and now, no cudarc either. 0.6.0 deletes scirs2-core’s old cudarc-based CUDA backend (gpu/backends/cuda.rs) outright and drops the cudarc dependency, a clean Pure-Rust win: GpuBackend::Cuda survives only as an honest enum tag whose context constructor tells you plainly to go use a crate’s own oxicuda-* cuda feature, instead of fabricating a context it can’t actually build. Every one of the ten new per-crate cuda paths is built the same way — directly against oxicuda-driver, oxicuda-memory, oxicuda-blas, oxicuda-solver, oxicuda-sparse, oxicuda-ptx, oxicuda-fft, and now oxicuda-dnn — pure-Rust PTX generation and CUDA driver JIT compilation, not a wrapped CUDA Toolkit and not a CPU/wgpu simulation wearing a CUDA label.
This is a minor release by number, but an architectural one: ten crates now carry their own NVIDIA-only cuda feature, every one of them off by default, so a default build still compiles zero oxicuda — the pure-Rust CUDA stack is there the moment you opt in, invisible the moment you don’t.
Why SciRS2 0.6.0 is a game changer
Centralized GPU abstractions are convenient right up until they become a bottleneck: every crate that wants NVIDIA performance has to route through one shared module, wait for that module’s API surface to grow to fit its needs, and inherit whatever compromises that module made for the crates that came before it. 0.6.0’s answer is architectural, not cosmetic — decentralize. Let each crate own its CUDA story, talk to oxicuda-* directly, and leave the portable wgpu path exactly where it already worked.
Concrete 0.6.0 wins:
- Ten crates get a direct oxicuda
cudafeature, off by default.scirs2-fft(1D C2C FFT viaoxicuda-fft),scirs2-symbolic(LoweredOp-to-PTX custom-kernel batch evaluation viaoxicuda-ptx),scirs2-interpolate(RBF GEMM + Cholesky solve viaoxicuda-blas+oxicuda-solver),scirs2-special(customerfPTX kernel),scirs2-stats(normal PDF/CDF PTX kernels),scirs2-graph(CSR sparse matrix-vector multiply viaoxicuda-sparse),scirs2-linalg(cuda_gemm+cuda_solve_spdviaoxicuda-blas+oxicuda-solver),scirs2-optimize(cuda_hessian_vector_productvia GEMV),scirs2-datasets(cuda_regression_targetvia GEMV), andscirs2-vision(cuda_convolve_2dviaoxicuda-dnn). - f64-native, not f32. Every oxicuda path is double precision — the concrete advantage a real CUDA backend has over the existing f32-only wgpu/WebGPU path.
- A default build compiles zero oxicuda. All ten
cudafeatures are additive and off by default; nothing changes for a build that never opts in. - GPU decentralized.
scirs2-coreis no longer the GPU aggregation hub — each crate now owns its CUDA story directly instead of routing through core. - Policy caught up with the architecture.
SCIRS2_POLICY.md’s GPU Operations Policy is rewritten to the two-stories model: wgpu/WebGPU for portability (retained in core),oxicuda-*for NVIDIA performance (per-crate, never through core).
Technical Deep Dive: two stories, one runtime probe
Three modules, two stories, one crate. scirs2-linalg’s own module map makes the split concrete: gpu is a self-contained local GPU abstraction with its own GpuContext trait (cuda/opencl/rocm/metal, unrelated to scirs2-core); gpu_linalg is the portability path, built on scirs2_core::gpu’s wgpu/WebGPU layer, f32, with CPU fallback; and the new gpu_cuda is the performance path — NVIDIA-only, f64, real CUDA via oxicuda-*, no fallback of its own. Every function in gpu_cuda degrades safely instead of panicking: cuda_is_available() checks whether oxicuda_driver::init() succeeds and at least one device is visible, and every compute entry point returns a LinalgError::ComputationError rather than aborting when no device is present.
cuda_gemm and cuda_solve_spd, concretely. cuda_gemm(a: &ArrayView2<f64>, b: &ArrayView2<f64>) -> LinalgResult<Array2<f64>> validates the inner dimensions, forces both operands into contiguous row-major host slices, uploads them via oxicuda_memory::DeviceBuffer, describes all three matrices as Layout::RowMajor, and dispatches oxicuda_blas::level3::gemm_api::gemm::<f64>. cuda_solve_spd(a: &ArrayView2<f64>, b: &ArrayView1<f64>) -> LinalgResult<Array1<f64>> factors the symmetric positive-definite A in place with oxicuda_solver::dense::cholesky (FillMode::Lower) and then calls cholesky_solve — the same layout pattern already proven in scirs2-interpolate’s cuda_rbf_solve. Neither function bakes in a CPU fallback; that decision is left to the caller, which is exactly the point — see Getting Started below.
Honesty, continued. 0.5.1 taught scirs2-core, scirs2-linalg, and scirs2-fft’s GPU paths to report BackendNotAvailable instead of fabricating a context. 0.6.0 extends that same discipline to the new oxicuda paths from day one: cuda_is_available() never panics, the internal context builder returns a real LinalgError::ComputationError the moment device count is zero, and every oxicuda-backed function keeps a CPU source of truth reachable behind its own runtime probe. Nothing in the new performance path is a wgpu computation wearing a CUDA label, and nothing silently downgrades precision or substitutes a wrong answer.
One feature name, ecosystem-wide. Before 0.6.0, the portable wgpu path had a different feature name in almost every crate that carried it: wgpu_backend in scirs2-core, gpu_wgpu in datasets/stats, wgpu_fft in fft, wgpu_kernels in special, wgpu_rbf in interpolate, and a bare gpu in vision/graph/optimize. 0.6.0 collapses all of them to one name — wgpu — so every crate touched by this release now exposes the same, predictable pair: cuda for NVIDIA performance, wgpu for cross-platform portability. It’s a small rename with an ecosystem-wide blast radius, which is why it’s called out as breaking below.
Getting Started
Add the crate with the new NVIDIA CUDA path enabled:
cargo add scirs2-linalg --features cuda
cuda is off by default, NVIDIA-only, and entirely additive — it changes nothing for a build that doesn’t request it. The pattern below is the one to copy: probe with cuda_is_available(), try the GPU path, and fall back to the CPU explicitly on any error rather than letting a device-specific failure propagate into code that should run anywhere:
use ndarray::{array, Array1, Array2};
use scirs2_linalg::gpu_cuda::{cuda_gemm, cuda_is_available, cuda_solve_spd};
fn main() {
// NVIDIA-only, runtime-probed via the CUDA driver itself.
// cuda_is_available() never panics, so it's safe to call on any machine.
println!("CUDA device present: {}", cuda_is_available());
let a: Array2<f64> = array![
[4.0, 1.0, 0.0],
[1.0, 3.0, 1.0],
[0.0, 1.0, 2.0],
];
let b: Array2<f64> = array![[1.0, 0.0], [0.0, 1.0], [1.0, 1.0]];
// f64 GEMM on the GPU when a device is present. With no CUDA device,
// cuda_gemm returns an honest LinalgError::ComputationError instead of a
// fabricated or silently-wrong result, so the CPU fallback below is
// always an explicit branch you write, never a hidden default.
let c: Array2<f64> = match cuda_gemm(&a.view(), &b.view()) {
Ok(gpu_result) => gpu_result,
Err(_) => a.dot(&b),
};
println!("A . B =\n{c:?}");
// Solve the same symmetric positive-definite A via GPU Cholesky
// factorization plus triangular solve, falling back to the CPU solver
// on the same honest-error contract.
let rhs: Array1<f64> = array![1.0, 2.0, 3.0];
let x = match cuda_solve_spd(&a.view(), &rhs.view()) {
Ok(gpu_x) => gpu_x,
Err(_) => match scirs2_linalg::solve(&a.view(), &rhs.view(), None) {
Ok(cpu_x) => cpu_x,
Err(e) => {
eprintln!("CPU solve failed too: {e}");
return;
}
},
};
println!("x =\n{x:?}");
}
Every oxicuda-* path in 0.6.0 follows this same shape: a cheap, panic-free probe, an honest error instead of a fabricated result when no device is present, and a CPU path you control explicitly rather than one the library silently substitutes for you.
What’s New in 0.6.0
Added
oxicuda-*CUDA backends across ten crates — a new off-by-default, NVIDIA-only, runtime-probedcudafeature andgpu_cudamodule in each, every path f64-native, a default build compiling zero oxicuda:scirs2-fft(oxicuda-fft, 1D C2C FFT with 1/N inverse normalization),scirs2-symbolic(oxicuda-ptx,LoweredOp-to-PTX custom-kernel f64 batch evaluation),scirs2-interpolate(oxicuda-blas+oxicuda-solver, RBF GEMM plus Cholesky solve),scirs2-special(oxicuda-ptx, customerfPTX kernel),scirs2-stats(oxicuda-ptx, normal PDF/CDF PTX kernels),scirs2-graph(oxicuda-sparse, CSR SpMV),scirs2-linalg(oxicuda-blas+oxicuda-solver,cuda_gemm+cuda_solve_spd),scirs2-optimize(oxicuda-blas,cuda_hessian_vector_productvia GEMV),scirs2-datasets(oxicuda-blas,cuda_regression_targetvia GEMV),scirs2-vision(oxicuda-dnn,cuda_convolve_2d).- Workspace:
oxicuda-dnnadded to[workspace.dependencies], joining the existingoxicuda-driver/-memory/-fft/-ptx/-launch/-blas/-solver/-sparsepath dependencies.
Changed
- BREAKING (GPU features): every real (
dep:wgpu) wgpu/WebGPU portability feature standardized under a single name,wgpu—scirs2-core‘swgpu_backend, the per-crategpuin vision/graph/optimize,gpu_wgpuin datasets/stats,wgpu_fftin fft,wgpu_kernelsin special, andwgpu_rbfin interpolate all becomewgpu. Every touched crate now exposes a consistentcuda(oxicuda) +wgpu(portable) pair. Unchanged: core’sgpuumbrella feature,array_protocol_wgpu, stats’gpucore-abstraction passthrough, and the empty placeholder flagsscirs2-integrate/gpu_femandscirs2-interpolate/gpu_kdtree. - GPU decentralized:
scirs2-coreis no longer the GPU aggregation hub — each crate now owns its CUDA story through a directoxicuda-*dependency instead of routing through core. - GPU policy:
SCIRS2_POLICY.md’s GPU Operations Policy rewritten to the two-stories model — wgpu/WebGPU for portability (retained in core),oxicuda-*for NVIDIA performance (per-crate, never through core).
Removed
scirs2-core: retired core’s own cudarc-based CUDA backend — deletedgpu/backends/cuda.rsand dropped thecudarcdependency (a Pure-Rust win);GpuBackend::Cudanow survives only as an enum tag whose context constructor returns an honest error directing users to the per-crateoxicuda-*cudafeatures, while the portable wgpuGpuNdarray/WebGPUContextis retained.scirs2-core: removed the now-unusedcudaand deadarray_protocol_cudafeatures.scirs2-cluster: removed dead GPU feature aliases (opencl/metal/oneapi).
Docs
scirs2-spatial: corrected GPU documentation to reflect reality — itsgpu_accelpath is a CPU-SIMD fallback, not a GPU backend.
Tips
- Rename your wgpu feature flag — six old names, one new one. The portable WebGPU path is now just
wgpueverywhere:wgpu_backend→wgpuin scirs2-core,gpu→wgpuin vision/graph/optimize,gpu_wgpu→wgpuin datasets/stats,wgpu_fft→wgpuin fft,wgpu_kernels→wgpuin special, andwgpu_rbf→wgpuin interpolate. If yourCargo.toml, CI config, or build scripts still reference one of the old names, update it — Cargo will reject an unknown feature name rather than silently ignoring it. - Know what did not rename. scirs2-core’s
gpuumbrella feature,array_protocol_wgpu, and scirs2-stats’gpucore-abstraction passthrough all keep their existing names — don’t touch those. The empty placeholder featuresscirs2-integrate/gpu_femandscirs2-interpolate/gpu_kdtreeare untouched too; they were never wired to a realdep:wgpuin the first place, so there was nothing to rename. cudaandwgpunow form one consistent pair per crate. Every crate this release touches exposes both:cudafor direct NVIDIA performance (new, off by default) andwgpufor cross-platform portability (renamed, same behavior as before). Enable either independently, both together, or neither — they don’t conflict, and enabling one never silently pulls in the other.- Turning on
cudacosts you nothing without a GPU. It’s a runtime probe, not a hard requirement:cuda_is_available()never panics, and every oxicuda-backed function (cuda_gemm,cuda_solve_spd,cuda_convolve_2d, and the rest) returns a normal error instead of aborting when no NVIDIA device is present. It’s safe to enable in a sharedCargo.tomleven if half your fleet has no GPU at all. - A default build still compiles zero oxicuda. All ten new
cudafeatures are additive and off by default, so if you haven’t opted in explicitly, nothing about your build, binary size, or compile time has changed in 0.6.0. - Write your CPU fallback explicitly — the library won’t do it for you. None of the new
cuda_*functions fall back to the CPU internally on error; that’s intentional, so you always see and control the branch. Match on theResult, as in Getting Started above, rather than assuming a GPU failure degrades gracefully on its own.
This is the foundation
SciRS2 0.6.0 is the sovereign scientific-computing layer of the COOLJAPAN ecosystem — and a GPU-architecture release matters most precisely because so much is built on top of it:
- NumRS2 — NumPy-compatible N-dimensional arrays in pure Rust.
- PandRS — Pandas-compatible DataFrames.
- OptiRS — advanced ML optimizers extending SciRS2.
- ToRSh — a PyTorch-compatible deep-learning framework.
- SkleaRS — a scikit-learn-compatible ML library.
- TrustformeRS — Hugging Face Transformers in pure Rust.
Every one of these inherits SciRS2’s GPU story along with its gradients, its error types, and its dependency tree — so when scirs2-linalg gets a direct cuda_gemm, or scirs2-optimize gets a direct cuda_hessian_vector_product, the acceleration lands exactly where the computation already lives, with no detour through a shared core module standing between the crate and the hardware. The numeric core still rests on OxiBLAS and OxiFFT; verification on OxiZ; the new NVIDIA performance path on OxiCUDA; compression on OxiARC; storage on OxiSQL and OxiH5. No C. No Fortran. No exceptions in the default build.
Repository: https://github.com/cool-japan/scirs
Star the repo if a GPU story that’s honest about what’s portable and what’s NVIDIA-only — and doesn’t need a single line of C to be fast — is what you’ve been waiting for.
Pure Rust scientific computing — decentralized on GPU, sovereign to the core.
— KitaSan at COOLJAPAN OÜ July 1, 2026