ToRSh 0.1.3 Released — GPU Backend via OxiCUDA and Zero C/asm in the Build

The GPU backend is real, the last C/asm dep is gone, and ToRSh now speaks JavaScript.

Today we released ToRSh 0.1.3 — the GPU and sovereignty release, where the OxiCUDA compute backend plugs in without requiring the CUDA SDK at build time, the final C/asm dependency (ring) is replaced by pure-Rust RustCrypto, ring all-reduce arrives for multi-GPU training, and the Node.js N-API binding layer reaches completion.

ToRSh — “Tensor Operations in Rust with Sharding” — is a PyTorch-compatible deep-learning framework built entirely in pure Rust. No C. No C++. No Fortran. No Python runtime. Where PyTorch depends on libtorch/ATen, a full CUDA toolchain, and a Python interpreter just to run inference, ToRSh compiles to a single static binary you can ship to bare metal, a container, or a WASM target with nothing else installed. As of 0.1.3, that binary is also free of C/asm: the ring crate — the last non-Rust spot in the default build — has been swapped for RustCrypto’s aes-gcm, chacha20poly1305, pbkdf2, and hmac.

Why ToRSh 0.1.3 is the GPU inflection point

Every previous release deferred true GPU work: 0.1.0 built the foundation, 0.1.1 added domain crates, 0.1.2 made CPU SIMD real. 0.1.3 is where the GPU story begins in earnest:

OxiCUDA backend, no SDK required. The new CudaBackend in torsh-tensor implements the oxicuda_backend::ComputeBackend trait by loading the CUDA driver at runtime via libloading — no CUDA SDK, no -lcudart, no build-time linkage. Enable it with features = ["cuda"]; the binary still works on CPU-only boxes.
PTX-backed kernel dispatch. ptx_ops.rs handles unary, binary, and reduction kernels through OxiCUDA’s PTX layer. gemm, conv2d_forward, and attention return BackendError::Unsupported pending upstream kernel additions — an honest stub, not a silent lie.
Real CUDA memory management. cudaMalloc, cudaMallocManaged (unified memory), and cudaHostAlloc (pinned host memory) are all live. The CudaMemoryManagerCoordinator has real fragmentation analysis, real pressure levels (Normal/Low/Medium/High/Critical), and real pool wiring — not stubs that silently return Default::default().
Ring all-reduce — bandwidth-optimal. The distributed backend now ships a proper ring all-reduce with the theoretical bandwidth bound: 2(N-1)/N × buffer_size. The previous gather+broadcast approach wasted bandwidth proportional to N; the ring approach does not. All ReduceOp variants — Sum, Product, Min, Max, Average, Mean — are covered.
15–30% automatic throughput gain. Phase-4 chunking helpers (ChunkingUtils::matrix_blocks, chunked_elementwise, chunked_sum, chunked_mean) tile large tensor operations into cache-friendly blocks and deliver measured 15–30% throughput improvements on large tensors with no code change from users.
COOLJAPAN Pure Rust Policy: complete. ring (C/asm AEAD) replaced by aes-gcm 0.11.0-rc.4, chacha20poly1305 0.11.0-rc.3, pbkdf2 0.13, and hmac 0.13. The default build is now 100% Rust.

Technical Deep Dive: the GPU stack

The compute abstraction. At the bottom sits oxicuda_backend::ComputeBackend — OxiCUDA’s trait for GPU dispatch. torsh-tensor’s new CudaBackend is a thin adapter over three OxiCUDA leaf crates: oxicuda-driver (driver API), oxicuda-launch (kernel launch), and oxicuda-ptx (PTX JIT). The gpu_dispatch.rs module sits above that, routing tensor operations through the trait — the same layer that lets CpuBackend satisfy the same interface for tests on machines without a GPU.

The memory coordinator. torsh-backend/src/cuda/memory/manager.rs now boots a real CudaMemoryManagerCoordinator via OnceLock, wires allocate_from_device_pool and return_to_device_pool through cust::cuda_malloc/free, and exposes configure_predictive_allocation, get_memory_statistics, and get_performance_metrics through live paths. Machines without a GPU return Default::default() before init — no panic, no fabricated data.

The distributed layer. Ring all-reduce in multi_gpu.rs replaces unsafe { mem::transmute } with a ReducibleElement type-safe dispatch trait for f32/f64. The algorithm is standard Horovod-style ring: 2(N-1)/N send-recv steps across the ring, accumulating into the local buffer without a gather step that would blow bandwidth.

The binding surface. Nine N-API handler modules — activations, creation, ops, nn, optim, reductions, clone_detach, helpers, utils_js — complete the Node.js JavaScript binding layer in torsh-ffi. TypeScript definitions ship alongside a Jest test suite. The Python side (torsh-python) migrates to PyO3 0.28’s Bound<'_, PyModule> API and re-enables torsh-data, torsh-autograd, and torsh-distributed, with a new src/data.rs exposing PyDataset, PyDataLoader, and PyDataLoaderIter.

Getting Started

# CPU (SIMD + parallel on by default)
cargo add torsh

# Enable the GPU backend (runtime CUDA driver load — no SDK required at build time)
cargo add torsh --features cuda

use torsh::prelude::*;

fn main() -> Result<()> {
    // Phase-4 chunking kicks in automatically on large tensors (15–30% faster)
    let x = randn(&[512, 512])?;
    let y = randn(&[512, 512])?;
    let out = x.matmul(&y)?;
    println!("shape: {:?}", out.shape());
    Ok(())
}

Opt into the GPU dispatch path:

use torsh::prelude::*;
use torsh_tensor::gpu_dispatch::GpuDispatch;

// GpuDispatch routes unary/binary f32 ops through CudaBackend when available,
// falling back to CpuBackend on machines without a driver.
let dispatch = GpuDispatch::new()?;
let x = randn_f32(&[1024])?;
let y = dispatch.relu_f32(&x)?;
println!("relu output on GPU (or CPU fallback): {:?}", y.shape());

From Node.js, after building the native module:

const torsh = require('@torsh/core');
const x = torsh.randn([64, 64]);
const y = torsh.matmul(x, x);
console.log('output shape:', y.shape());

What’s New in 0.1.3

Added

CudaBackend in torsh-tensor: OxiCUDA compute backend, no CUDA SDK at build time
ptx_ops.rs: PTX-backed unary/binary/reduce dispatch through oxicuda-ptx
gpu_dispatch.rs: unified dispatch layer over dyn ComputeBackend
gpu and cuda feature flags in torsh-tensor; oxicuda-backend/driver/launch/ptx in workspace deps
Real CUDA allocators: cudaMalloc, cudaMallocManaged, cudaHostAlloc; real calculate_fragmentation_level
ReducibleElement type-safe dispatch trait (replaces unsafe { mem::transmute } in multi_gpu.rs)
Ring all-reduce: bandwidth-optimal 2(N-1)/N × buffer_size algorithm; all ReduceOp variants
14 new distributed tests (tests_p3) covering ring all-reduce correctness and edge cases
Phase-4 chunking helpers: ChunkingUtils::matrix_blocks, chunked_elementwise, chunked_sum, chunked_mean
SIMD-accelerated forward pass: simd_optimized_forward using scirs2_core::simd_ops::simd_matrix_multiply_f32
9 N-API handler modules completing the Node.js binding layer in torsh-ffi
TypeScript definitions and Jest test suite for Node.js bindings
PyO3 0.28 migration in torsh-python; PyDataset / PyDataLoader / PyDataLoaderIter in src/data.rs
NaturalCubicSpline struct and spline interpolation in torsh-series
SsaModel (Singular Spectrum Analysis): fit / forecast with power-iteration eigenvectors
MSTL (Multiple Seasonal-Trend decomposition using Loess) in torsh-series
LSTM, Transformer, and CNN-based forecasters in torsh-series::forecast::deep
Proper p-values for augmented_dickey_fuller_test, kpss_test, phillips_perron_test
DifferentialFlamegraph::compare() in torsh-autograd: real FlamegraphComparison / FrameDelta
torsh-vision: ImageRegistrar::apply_transformation, FramePreprocessor, 3D visualization utilities
Flash attention (causal and non-causal) in torsh-functional::attention
Cross-platform SIMD validation benchmark in torsh-benches

Changed

All scirs2 deps bumped 0.4.2 → 0.5.1 (19 sub-crates)
oxiarc-archive / oxiarc-core bumped 0.2.7 → 0.3.3; oxiarc-deflate / oxiarc-zstd bumped 0.2 → 0.3.3
oxifft updated to 0.3.2; oxionnx to 0.1.4; oxicode to 0.2.4
gpu feature in torsh-autograd is now empty (backward GPU dispatch deferred to OxiCUDA)
CUDA memory optimization files refactored below the 2000-line policy limit

Fixed

ring (C/asm) replaced by pure-Rust RustCrypto AEAD (aes-gcm, chacha20poly1305, pbkdf2, hmac) — COOLJAPAN Pure Rust Policy now fully enforced in the default build
2 root compilation errors that were blocking 1100+ downstream optimization-module errors
torsh-distributed: eradicated silent fabrications in cluster state
BernoulliDistribution sampling now generates correct binary outputs from probabilities
QuadraticProgrammingLayer::backward shape mismatch fixed

Tips

Enable CUDA with a single feature flag — no SDK needed at build. Add features = ["cuda"] to your torsh dependency. The driver is loaded at runtime via libloading; the binary compiles and runs on CPU-only boxes without change.
Use chunking helpers for large matrix ops. ChunkingUtils::matrix_blocks(m, n, k, 4) gives you row-strip blocking automatically; chunked_elementwise / chunked_sum / chunked_mean tile reductions. These are the same helpers that deliver the 15–30% throughput gains and they are available in your own kernel code.
Migrate distributed ops to ring all-reduce. If you were manually gathering results across GPUs, replace with ring_all_reduce — the bandwidth is 2(N-1)/N × buffer_size regardless of N, vs. the old gather approach which scaled linearly.
Node.js bindings are now complete. torsh-ffi exposes activations, creation, elementwise ops, nn layers, optimizers, and reductions through nine N-API modules with TypeScript definitions. Build with cargo build --release --features nodejs and npm run build:js.
Python dataloader now round-trips. torsh-python’s PyDataLoader and PyDataLoaderIter wrap the real torsh-data API; torsh-autograd and torsh-distributed are re-enabled. Use import rstorch; dl = rstorch.data.DataLoader(dataset, batch_size=32).
Time-series forecasting now has deep models. torsh-series::forecast::deep ships LSTM, Transformer, and CNN forecasters alongside the new SSA and MSTL decompositions. Statistical test p-values (adf, kpss, pp) are now computed properly rather than hardcoded.

This is the foundation

ToRSh 0.1.3 is powered by — and in turn powers — the wider COOLJAPAN ecosystem:

OxiCUDA (oxicuda-backend, oxicuda-driver, oxicuda-launch, oxicuda-ptx) — the pure-Rust GPU compute stack that makes the new backend possible
SciRS2 0.5.1 — 19 sub-crates providing the scientific computing foundation: SIMD ops, BLAS, FFT, autograd, sparse, signal, graph, series, vision, text, and more
OxiBLAS — pure-Rust BLAS/LAPACK, reached through scirs2-core’s oxiblas-blas and oxiblas-lapack features
OxiARC 0.3.3 — pure-Rust compression/archives; powers model-hub streaming extraction
OxiCode 0.2.4 — binary serialization for model checkpoints
OxiFFT 0.3.2 — FFT used throughout the signal and series crates
OxiONNX 0.1.4 — ONNX interop
OptiRS 0.3.1 — 70+ optimizers available through torsh-optim

Repository: https://github.com/cool-japan/torsh

Star the repo if a pure-Rust, single-binary deep-learning framework with a real GPU backend — and not a line of C/asm in the default build — is something you want to see reach 1.0.

The era of mandatory C++ runtimes and CUDA SDK lockout is over. Pure Rust deep learning is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ
June 30, 2026