GPU-accelerated scientific computing, written entirely in Rust — no CUDA C, no vendor toolchain, the same code running native on your laptop and inside a browser tab.
Today we released SciRS2 0.5.0 — GPU acceleration goes real across the workspace via pure-Rust wgpu, alongside serious advanced numerics and a maturing computer algebra system.
No C. No Fortran. No CUDA toolchain. No NumPy/SciPy system dependencies. The headline of 0.5.0 is that the GPU story is now real — and it is real on pure-Rust WebGPU (wgpu), not on NVIDIA’s CUDA C. That distinction matters: because the compute kernels are WGSL running through wgpu, the very same SciRS2 code path runs natively on Linux/macOS/Windows and compiles to WebAssembly to run in the browser via WebGPU. Everything still compiles down to a single static binary (or a WASM module), with graceful CPU fallback when no GPU adapter is present — so it stays pure Rust by default.
This is a confident minor milestone (0.4.x → 0.5.0): roughly 36,082 tests passing across 29 workspace crates, nearly 4 million lines of Rust, 80,800+ public API items, zero warnings (clippy + rustdoc + fmt clean), Apache-2.0.
Why SciRS2 0.5.0 is a game changer
The pain is familiar. NumPy and SciPy are CPU-bound and Python-slow, and the moment you reach for the GPU you inherit the CUDA C toolchain, a driver/version matrix, and vendor lock-in. You write your science in Python, then rewrite the hot loops in C/C++/CUDA, then babysit the build. SciRS2 0.5.0 takes a different path: GPU acceleration that is pure Rust, portable, and browser-ready.
Concrete 0.5.0 wins:
- Real wgpu
GpuNdarray<f32>inscirs2-core— a singleton WebGPU context behind aOnceLock, backed by 7 hand-written WGSL kernels (elementwise add/sub/mul/scalar, naive matmul, two-pass sum, 16×16 transpose). 8 tests cover the round-trip. - GPU graph algorithms in
scirs2-graph— real WGSL BFS (level-synchronous), Bellman-Ford SSSP, and delta-stepping, with a rayon CPU fallback. - GPU optimizers in
scirs2-optimize— L-BFGS, CG, and Newton built onGpuNdarray, with the Hessian-vector matmul of the Newton CG subsolver running on the GPU. - GPU RBF interpolation in
scirs2-interpolate— a real wgpu radial-basis-function kernel-matrix build and evaluation, with per-stage timing. - Correct Pantelides DAE index reduction in
scirs2-integrate— full graph algorithms (Hopcroft-Karp + Tarjan) replacing the old heuristic, with 13 tests. - Wiktorsson (2001) Lévy-area for SDE strong order 1.5, wired into the strong general SRK solver, with 10 tests.
- A 60× cluster speedup: the LRSC subspace-clustering path drops from 120s → 2s.
The test counts are the credibility: 13 Pantelides tests, 10 Lévy-area tests, 8 GpuNdarray tests — all green inside that 36,082-test sweep.
Technical Deep Dive: Pure-Rust GPU via wgpu
Layer 1 — GpuNdarray<f32> in scirs2-core. The foundation lives in array_protocol/gpu_ndarray.rs. A single WebGPUContext is initialized lazily through a OnceLock singleton and shared across the workspace. On top of it sit 7 WGSL compute kernels: elementwise add/sub/mul/scalar, a naive matmul, a two-pass parallel sum, and a 16×16 tiled transpose. The 0.5.0 release also adds concat_axis.wgsl (uniform-stride gather for axis > 0) and reduce_sum_axis.wgsl (per-output axis reduction for rank ≥ 3), and fills in 13 WGSL optimizer/integrator kernel slots — Adam/SGD/RMSprop/Adagrad/LAMB, memcpy/fill, reduce_sum/reduce_max, and the RK4 stages. The whole layer is gated behind the array_protocol_wgpu feature, and a public GpuNdarray::matmul() wrapper exposes the matmul kernel to downstream crates.
Layer 2 — GPU across the crates.
scirs2-graphships real WGSL graph traversal: a level-synchronous BFS usingatomicCompareExchange, a Bellman-Ford SSSP that does edge-parallelatomicMinon the f32 bit-pattern, and a true delta-stepping kernel with light/heavy phases driven to convergence by achanged_flag. When the GPU isn’t worth it, it falls back to a CPU-parallel BFS/Bellman-Ford via rayon +AtomicU32. Dispatch is thresholded atn_edges < 4096, and aCpuParalleldispatch bug was fixed along the way.scirs2-optimizebuildslbfgs_gpu.rs(two-loop recursion with dot/scale/add/subtract on the GPU),cg_gpu.rs, andnewton_gpu.rs(Hessian-vector matmul on the GPU for the CG subsolver) — all onGpuNdarray, all with agpu_threshold_overrideknob.scirs2-interpolateadds thewgpu_rbffeature: a real RBF kernel-matrix + evaluation WGSL (akernel_iduniform, 16×16 and 64-wide workgroups), a realis_gpu_available()OnceLockprobe, and per-stageGpuStatstiming. The module was split intogpu_accelerated/mod.rs+wgpu_rbf.rswith 5 tests.scirs2-specialaddswgpu_kernelsbatch kernels forgamma,erf,bessel_j0, andlgamma, each with a gracefulGpuNotAvailablefallback.scirs2-transformturnsGpuPCA::fit/transform/fit_transforminto a real computation that delegates to CPU SVD (no longer aNotImplementedError).scirs2-symbolicgains aneval_batchGPU path (with an f64→f32 cast at the buffer boundary and aGpuError::NoAdaptervariant) and 4 GPU smoke tests.
Layer 3 — advanced numerics. Beyond the GPU, 0.5.0 lands genuinely hard algorithms. scirs2-integrate replaces the heuristic find_singular_subsets with the full Pantelides machinery — Hopcroft-Karp O(E√V) bipartite matching plus Tarjan iterative SCC — so DAE index reduction is correct, not approximate; and it adds the Wiktorsson 2001 truncated-series Lévy-area in sde/levy_area.rs. scirs2-spatial ships 2D + 3D Hilbert curve sorting, including a 24-state Butz/Hamilton lookup table for the 3D case (hilbert_d2/hilbert_d3, inverse, f64, and hilbert_sort_2d/hilbert_sort_3d, 8 tests). scirs2-core adds NUMA-locality par_map_chunks (Linux pthread affinity pinning, rayon fallback for Darwin/WASM). And the scirs2-cluster 60× win was a real algorithmic fix: the LRSC/SSC timeouts came from a full eigendecomposition inside ADMM, replaced by a sign-aware early-exit power iteration with min-eigenvalue / min-σ² thresholding to skip sub-threshold SVT modes — LRSC went 120s→2s, SSC 120s→33s, all 18 subspace tests green.
Layer 4 — the maturing CAS. The computer algebra system introduced in 0.4.4 keeps growing. scirs2-symbolic adds ALiBi symbolic positional bias (attention/symbolic_alibi.rs: alibi_slope, alibi_bias_expr, alibi_bias_matrix_symbolic, and verify_symbolic_vs_numerical confirming max_diff < 1e-14 against the scirs2-neural baseline). The differential-geometry layer now computes the Riemann tensor R^μ_{νρσ} (4-term formula via symbolic gradients of the Christoffel symbols), the Ricci trace, and a full-n Weyl decomposition, with 10 integration tests spanning Schwarzschild, Minkowski, Bianchi, and the Kretschmann scalar. neural_priors.rs adds discover_series_prior (sliding-window symbolic regression) and series_prior_regularization, with NUMA wire-up for parallel predict. scirs2-neural exposes a SymbolicPriorLoss, and scirs2-autograd lands a correctness repair (a ScalarMulOp added to gradient name-dispatch) plus a published jit_fusion module that extends fusion to matmul epilogues and batched-matmul→reduction.
Throughout, every GPU path is feature-gated with graceful CPU fallback (GpuNotAvailable / NoAdapter) — so SciRS2 stays pure Rust by default and you opt into the GPU only when you want it.
Getting Started
Add the crate:
cargo add scirs2
A minimal GPU example — build a GpuNdarray<f32>, add elementwise on the GPU, and read the result back. It transparently falls back to CPU when no adapter is present:
use scirs2_core::array_protocol::gpu_ndarray::GpuNdarray;
fn main() {
// Two small f32 arrays uploaded to the GPU (pure-Rust wgpu).
let a = GpuNdarray::<f32>::from_vec(vec![1.0, 2.0, 3.0, 4.0], vec![2, 2]);
let b = GpuNdarray::<f32>::from_vec(vec![10.0, 20.0, 30.0, 40.0], vec![2, 2]);
// Elementwise add runs the WGSL kernel on the GPU,
// or falls back to CPU gracefully if no adapter is found.
let sum = a.add(&b);
// Matmul exercises the naive matmul kernel.
let prod = a.matmul(&b);
println!("sum = {:?}", sum.to_vec());
println!("prod = {:?}", prod.to_vec());
}
To actually dispatch to the GPU, enable the feature in Cargo.toml:
[dependencies]
scirs2 = { version = "0.5.0", features = ["array_protocol_wgpu"] }
Prefer to stay on the CPU and exercise the new spatial work instead? Sort 3D points along a Hilbert curve for better locality:
use scirs2_spatial::hilbert_sort_3d;
fn main() {
let mut points = vec![
[0.10_f64, 0.90, 0.40],
[0.80, 0.20, 0.95],
[0.50, 0.50, 0.50],
[0.05, 0.05, 0.05],
];
// Reorders points by their position along a 3D Hilbert curve
// (24-state Butz/Hamilton lookup) so spatially-near points
// end up near each other in memory.
hilbert_sort_3d(&mut points);
println!("{:?}", points);
}
What’s New in 0.5.0
GPU / wgpu (pure-Rust WebGPU):
GpuNdarray<f32>inscirs2-corewith 7 WGSL kernels (elementwise, matmul, sum, transpose), plusconcat_axis/reduce_sum_axisand 13 optimizer/RK4 kernel slots;array_protocol_wgpufeature.- GPU graph algorithms in
scirs2-graph: BFS, Bellman-Ford, delta-stepping, with rayon CPU fallback. - GPU optimizers in
scirs2-optimize: L-BFGS, CG, Newton — built onGpuNdarray. - GPU RBF interpolation in
scirs2-interpolate(wgpu_rbf,is_gpu_available()probe,GpuStatstiming). - GPU special functions in
scirs2-special(wgpu_kernels: gamma/erf/bessel_j0/lgamma). - Real
GpuPCAinscirs2-transform.
Advanced numerics:
- Correct Pantelides DAE index reduction (Hopcroft-Karp + Tarjan) in
scirs2-integrate. - Wiktorsson Lévy-area for SDE strong order 1.5.
- 2D + 3D Hilbert curves in
scirs2-spatial(hilbert_sort_2d/hilbert_sort_3d). - NUMA-locality
par_map_chunksinscirs2-core. - 60× cluster speedup (LRSC 120s→2s) in
scirs2-cluster.
CAS maturation:
- ALiBi symbolic positional bias in
scirs2-symbolic. - Differential geometry: Riemann / Ricci / Weyl tensors.
- Symbolic priors (
discover_series_prior) +SymbolicPriorLossinscirs2-neural. jit_fusionmodule and a gradient-dispatch correctness fix inscirs2-autograd.
Tips
- Opt into the GPU per feature. Enable
array_protocol_wgpufor the coreGpuNdarray, and per-crate featureswgpu_rbf(interpolate) andwgpu_kernels(special). Everything falls back to CPU gracefully (GpuNotAvailable/NoAdapter), so it is safe to ship the feature on even where no GPU exists. - GPU only wins above a threshold. WGSL kernels pay a dispatch + upload cost, so small problems are faster on the CPU. Graph algorithms dispatch to the GPU at
n_edges ≥ 4096; the optimizers exposegpu_threshold_overrideso you can tune the crossover for your hardware. - Sort before you search. Run
hilbert_sort_2d/hilbert_sort_3dover your points before building a k-d tree or doing nearest-neighbor queries — the improved spatial locality pays off in cache behavior. - Use the new DAE path for stiff systems. For DAEs where the old heuristic mis-detected singular subsets, the new Pantelides index reduction (Hopcroft-Karp + Tarjan) is correct — reach for it when index reduction matters.
- For SDEs, use the Lévy-area path. The Wiktorsson Lévy-area gives you strong order 1.5 in the strong general SRK solver — a real accuracy upgrade over lower-order schemes.
- Ship to the browser. Because the GPU layer is wgpu/WebGPU, the same code compiles to
wasm32and runs on WebGPU in the browser — no separate kernel rewrite.
Part of the COOLJAPAN ecosystem
SciRS2 0.5.0 is pure-Rust scientific computing, and 0.5.0 makes its place in the ecosystem clearer than ever:
- GPU compute is pure-Rust wgpu / WebGPU — portable and browser-ready. If you specifically want NVIDIA CUDA, that path lives in oxicuda; SciRS2’s default GPU is vendor-neutral WebGPU.
- Numeric core: OxiBLAS for linear algebra and OxiFFT for transforms — no OpenBLAS, no FFTW, no C.
- CAS verification & parity: the symbolic layer verifies against OxiZ and a clean-room oxieml parity baseline.
- Peers: it composes with torsh, optirs, numrs, pandrs, and sklears for tensors, optimization, ndarray-style numerics, dataframes, and classical ML.
Repository: https://github.com/cool-japan/scirs
Star the repo if you want NumPy/SciPy/scikit-learn-grade scientific computing without the C, Fortran, or CUDA toolchain.
Pure Rust scientific computing — now GPU-accelerated, browser-ready, and sovereign.
— KitaSan at COOLJAPAN OÜ June 3, 2026