SciRS2 0.5.0 Released — Pure-Rust GPU Acceleration Goes Real (wgpu) Across the Stack

GPU-accelerated scientific computing, written entirely in Rust — no CUDA C, no vendor toolchain, the same code running native on your laptop and inside a browser tab.

Today we released SciRS2 0.5.0 — GPU acceleration goes real across the workspace via pure-Rust wgpu, alongside serious advanced numerics and a maturing computer algebra system.

No C. No Fortran. No CUDA toolchain. No NumPy/SciPy system dependencies. The headline of 0.5.0 is that the GPU story is now real — and it is real on pure-Rust WebGPU (wgpu), not on NVIDIA’s CUDA C. That distinction matters: because the compute kernels are WGSL running through wgpu, the very same SciRS2 code path runs natively on Linux/macOS/Windows and compiles to WebAssembly to run in the browser via WebGPU. Everything still compiles down to a single static binary (or a WASM module), with graceful CPU fallback when no GPU adapter is present — so it stays pure Rust by default.

This is a confident minor milestone (0.4.x → 0.5.0): roughly 36,082 tests passing across 29 workspace crates, nearly 4 million lines of Rust, 80,800+ public API items, zero warnings (clippy + rustdoc + fmt clean), Apache-2.0.

Why SciRS2 0.5.0 is a game changer

The pain is familiar. NumPy and SciPy are CPU-bound and Python-slow, and the moment you reach for the GPU you inherit the CUDA C toolchain, a driver/version matrix, and vendor lock-in. You write your science in Python, then rewrite the hot loops in C/C++/CUDA, then babysit the build. SciRS2 0.5.0 takes a different path: GPU acceleration that is pure Rust, portable, and browser-ready.

Concrete 0.5.0 wins:

Real wgpu GpuNdarray<f32> in scirs2-core — a singleton WebGPU context behind a OnceLock, backed by 7 hand-written WGSL kernels (elementwise add/sub/mul/scalar, naive matmul, two-pass sum, 16×16 transpose). 8 tests cover the round-trip.
GPU graph algorithms in scirs2-graph — real WGSL BFS (level-synchronous), Bellman-Ford SSSP, and delta-stepping, with a rayon CPU fallback.
GPU optimizers in scirs2-optimize — L-BFGS, CG, and Newton built on GpuNdarray, with the Hessian-vector matmul of the Newton CG subsolver running on the GPU.
GPU RBF interpolation in scirs2-interpolate — a real wgpu radial-basis-function kernel-matrix build and evaluation, with per-stage timing.
Correct Pantelides DAE index reduction in scirs2-integrate — full graph algorithms (Hopcroft-Karp + Tarjan) replacing the old heuristic, with 13 tests.
Wiktorsson (2001) Lévy-area for SDE strong order 1.5, wired into the strong general SRK solver, with 10 tests.
A 60× cluster speedup: the LRSC subspace-clustering path drops from 120s → 2s.

The test counts are the credibility: 13 Pantelides tests, 10 Lévy-area tests, 8 GpuNdarray tests — all green inside that 36,082-test sweep.

Technical Deep Dive: Pure-Rust GPU via wgpu

Layer 1 — GpuNdarray<f32> in scirs2-core. The foundation lives in array_protocol/gpu_ndarray.rs. A single WebGPUContext is initialized lazily through a OnceLock singleton and shared across the workspace. On top of it sit 7 WGSL compute kernels: elementwise add/sub/mul/scalar, a naive matmul, a two-pass parallel sum, and a 16×16 tiled transpose. The 0.5.0 release also adds concat_axis.wgsl (uniform-stride gather for axis > 0) and reduce_sum_axis.wgsl (per-output axis reduction for rank ≥ 3), and fills in 13 WGSL optimizer/integrator kernel slots — Adam/SGD/RMSprop/Adagrad/LAMB, memcpy/fill, reduce_sum/reduce_max, and the RK4 stages. The whole layer is gated behind the array_protocol_wgpu feature, and a public GpuNdarray::matmul() wrapper exposes the matmul kernel to downstream crates.

Layer 2 — GPU across the crates.

scirs2-graph ships real WGSL graph traversal: a level-synchronous BFS using atomicCompareExchange, a Bellman-Ford SSSP that does edge-parallel atomicMin on the f32 bit-pattern, and a true delta-stepping kernel with light/heavy phases driven to convergence by a changed_flag. When the GPU isn’t worth it, it falls back to a CPU-parallel BFS/Bellman-Ford via rayon + AtomicU32. Dispatch is thresholded at n_edges < 4096, and a CpuParallel dispatch bug was fixed along the way.
scirs2-optimize builds lbfgs_gpu.rs (two-loop recursion with dot/scale/add/subtract on the GPU), cg_gpu.rs, and newton_gpu.rs (Hessian-vector matmul on the GPU for the CG subsolver) — all on GpuNdarray, all with a gpu_threshold_override knob.
scirs2-interpolate adds the wgpu_rbf feature: a real RBF kernel-matrix + evaluation WGSL (a kernel_id uniform, 16×16 and 64-wide workgroups), a real is_gpu_available() OnceLock probe, and per-stage GpuStats timing. The module was split into gpu_accelerated/mod.rs + wgpu_rbf.rs with 5 tests.
scirs2-special adds wgpu_kernels batch kernels for gamma, erf, bessel_j0, and lgamma, each with a graceful GpuNotAvailable fallback.
scirs2-transform turns GpuPCA::fit/transform/fit_transform into a real computation that delegates to CPU SVD (no longer a NotImplementedError).
scirs2-symbolic gains an eval_batch GPU path (with an f64→f32 cast at the buffer boundary and a GpuError::NoAdapter variant) and 4 GPU smoke tests.

Layer 3 — advanced numerics. Beyond the GPU, 0.5.0 lands genuinely hard algorithms. scirs2-integrate replaces the heuristic find_singular_subsets with the full Pantelides machinery — Hopcroft-Karp O(E√V) bipartite matching plus Tarjan iterative SCC — so DAE index reduction is correct, not approximate; and it adds the Wiktorsson 2001 truncated-series Lévy-area in sde/levy_area.rs. scirs2-spatial ships 2D + 3D Hilbert curve sorting, including a 24-state Butz/Hamilton lookup table for the 3D case (hilbert_d2/hilbert_d3, inverse, f64, and hilbert_sort_2d/hilbert_sort_3d, 8 tests). scirs2-core adds NUMA-locality par_map_chunks (Linux pthread affinity pinning, rayon fallback for Darwin/WASM). And the scirs2-cluster 60× win was a real algorithmic fix: the LRSC/SSC timeouts came from a full eigendecomposition inside ADMM, replaced by a sign-aware early-exit power iteration with min-eigenvalue / min-σ² thresholding to skip sub-threshold SVT modes — LRSC went 120s→2s, SSC 120s→33s, all 18 subspace tests green.

Layer 4 — the maturing CAS. The computer algebra system introduced in 0.4.4 keeps growing. scirs2-symbolic adds ALiBi symbolic positional bias (attention/symbolic_alibi.rs: alibi_slope, alibi_bias_expr, alibi_bias_matrix_symbolic, and verify_symbolic_vs_numerical confirming max_diff < 1e-14 against the scirs2-neural baseline). The differential-geometry layer now computes the Riemann tensor R^μ_{νρσ} (4-term formula via symbolic gradients of the Christoffel symbols), the Ricci trace, and a full-n Weyl decomposition, with 10 integration tests spanning Schwarzschild, Minkowski, Bianchi, and the Kretschmann scalar. neural_priors.rs adds discover_series_prior (sliding-window symbolic regression) and series_prior_regularization, with NUMA wire-up for parallel predict. scirs2-neural exposes a SymbolicPriorLoss, and scirs2-autograd lands a correctness repair (a ScalarMulOp added to gradient name-dispatch) plus a published jit_fusion module that extends fusion to matmul epilogues and batched-matmul→reduction.

Throughout, every GPU path is feature-gated with graceful CPU fallback (GpuNotAvailable / NoAdapter) — so SciRS2 stays pure Rust by default and you opt into the GPU only when you want it.

Getting Started

Add the crate:

cargo add scirs2

A minimal GPU example — build a GpuNdarray<f32>, add elementwise on the GPU, and read the result back. It transparently falls back to CPU when no adapter is present:

use scirs2_core::array_protocol::gpu_ndarray::GpuNdarray;

fn main() {
    // Two small f32 arrays uploaded to the GPU (pure-Rust wgpu).
    let a = GpuNdarray::<f32>::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2]);
    let b = GpuNdarray::<f32>::from_vec(vec![10.0, 20.0, 30.0, 40.0], vec![2, 2]);

    // Elementwise add runs the WGSL kernel on the GPU,
    // or falls back to CPU gracefully if no adapter is found.
    let sum = a.add(&b);

    // Matmul exercises the naive matmul kernel.
    let prod = a.matmul(&b);

    println!("sum  = {:?}", sum.to_vec());
    println!("prod = {:?}", prod.to_vec());
}

To actually dispatch to the GPU, enable the feature in Cargo.toml:

[dependencies]
scirs2 = { version = "0.5.0", features = ["array_protocol_wgpu"] }

Prefer to stay on the CPU and exercise the new spatial work instead? Sort 3D points along a Hilbert curve for better locality:

use scirs2_spatial::hilbert_sort_3d;

fn main() {
    let mut points = vec![
        [0.10_f64, 0.90, 0.40],
        [0.80, 0.20, 0.95],
        [0.50, 0.50, 0.50],
        [0.05, 0.05, 0.05],
    ];

    // Reorders points by their position along a 3D Hilbert curve
    // (24-state Butz/Hamilton lookup) so spatially-near points
    // end up near each other in memory.
    hilbert_sort_3d(&mut points);

    println!("{:?}", points);
}

What’s New in 0.5.0

GPU / wgpu (pure-Rust WebGPU):

GpuNdarray<f32> in scirs2-core with 7 WGSL kernels (elementwise, matmul, sum, transpose), plus concat_axis / reduce_sum_axis and 13 optimizer/RK4 kernel slots; array_protocol_wgpu feature.
GPU graph algorithms in scirs2-graph: BFS, Bellman-Ford, delta-stepping, with rayon CPU fallback.
GPU optimizers in scirs2-optimize: L-BFGS, CG, Newton — built on GpuNdarray.
GPU RBF interpolation in scirs2-interpolate (wgpu_rbf, is_gpu_available() probe, GpuStats timing).
GPU special functions in scirs2-special (wgpu_kernels: gamma/erf/bessel_j0/lgamma).
Real GpuPCA in scirs2-transform.

Advanced numerics:

Correct Pantelides DAE index reduction (Hopcroft-Karp + Tarjan) in scirs2-integrate.
Wiktorsson Lévy-area for SDE strong order 1.5.
2D + 3D Hilbert curves in scirs2-spatial (hilbert_sort_2d/hilbert_sort_3d).
NUMA-locality par_map_chunks in scirs2-core.
60× cluster speedup (LRSC 120s→2s) in scirs2-cluster.

CAS maturation:

ALiBi symbolic positional bias in scirs2-symbolic.
Differential geometry: Riemann / Ricci / Weyl tensors.
Symbolic priors (discover_series_prior) + SymbolicPriorLoss in scirs2-neural.
jit_fusion module and a gradient-dispatch correctness fix in scirs2-autograd.

Tips

Opt into the GPU per feature. Enable array_protocol_wgpu for the core GpuNdarray, and per-crate features wgpu_rbf (interpolate) and wgpu_kernels (special). Everything falls back to CPU gracefully (GpuNotAvailable / NoAdapter), so it is safe to ship the feature on even where no GPU exists.
GPU only wins above a threshold. WGSL kernels pay a dispatch + upload cost, so small problems are faster on the CPU. Graph algorithms dispatch to the GPU at n_edges ≥ 4096; the optimizers expose gpu_threshold_override so you can tune the crossover for your hardware.
Sort before you search. Run hilbert_sort_2d / hilbert_sort_3d over your points before building a k-d tree or doing nearest-neighbor queries — the improved spatial locality pays off in cache behavior.
Use the new DAE path for stiff systems. For DAEs where the old heuristic mis-detected singular subsets, the new Pantelides index reduction (Hopcroft-Karp + Tarjan) is correct — reach for it when index reduction matters.
For SDEs, use the Lévy-area path. The Wiktorsson Lévy-area gives you strong order 1.5 in the strong general SRK solver — a real accuracy upgrade over lower-order schemes.
Ship to the browser. Because the GPU layer is wgpu/WebGPU, the same code compiles to wasm32 and runs on WebGPU in the browser — no separate kernel rewrite.

Part of the COOLJAPAN ecosystem

SciRS2 0.5.0 is pure-Rust scientific computing, and 0.5.0 makes its place in the ecosystem clearer than ever:

GPU compute is pure-Rust wgpu / WebGPU — portable and browser-ready. If you specifically want NVIDIA CUDA, that path lives in oxicuda; SciRS2’s default GPU is vendor-neutral WebGPU.
Numeric core: OxiBLAS for linear algebra and OxiFFT for transforms — no OpenBLAS, no FFTW, no C.
CAS verification & parity: the symbolic layer verifies against OxiZ and a clean-room oxieml parity baseline.
Peers: it composes with torsh, optirs, numrs, pandrs, and sklears for tensors, optimization, ndarray-style numerics, dataframes, and classical ML.

Repository: https://github.com/cool-japan/scirs

Star the repo if you want NumPy/SciPy/scikit-learn-grade scientific computing without the C, Fortran, or CUDA toolchain.

Pure Rust scientific computing — now GPU-accelerated, browser-ready, and sovereign.

— KitaSan at COOLJAPAN OÜ June 3, 2026