The GPU backend is real, the last C/asm dep is gone, and ToRSh now speaks JavaScript.
Today we released ToRSh 0.1.3 — the GPU and sovereignty release, where the OxiCUDA compute backend plugs in without requiring the CUDA SDK at build time, the final C/asm dependency (ring) is replaced by pure-Rust RustCrypto, ring all-reduce arrives for multi-GPU training, and the Node.js N-API binding layer reaches completion.
ToRSh — “Tensor Operations in Rust with Sharding” — is a PyTorch-compatible deep-learning framework built entirely in pure Rust. No C. No C++. No Fortran. No Python runtime. Where PyTorch depends on libtorch/ATen, a full CUDA toolchain, and a Python interpreter just to run inference, ToRSh compiles to a single static binary you can ship to bare metal, a container, or a WASM target with nothing else installed. As of 0.1.3, that binary is also free of C/asm: the ring crate — the last non-Rust spot in the default build — has been swapped for RustCrypto’s aes-gcm, chacha20poly1305, pbkdf2, and hmac.
Why ToRSh 0.1.3 is the GPU inflection point
Every previous release deferred true GPU work: 0.1.0 built the foundation, 0.1.1 added domain crates, 0.1.2 made CPU SIMD real. 0.1.3 is where the GPU story begins in earnest:
- OxiCUDA backend, no SDK required. The new
CudaBackendintorsh-tensorimplements theoxicuda_backend::ComputeBackendtrait by loading the CUDA driver at runtime via libloading — no CUDA SDK, no-lcudart, no build-time linkage. Enable it withfeatures = ["cuda"]; the binary still works on CPU-only boxes. - PTX-backed kernel dispatch.
ptx_ops.rshandles unary, binary, and reduction kernels through OxiCUDA’s PTX layer.gemm,conv2d_forward, andattentionreturnBackendError::Unsupportedpending upstream kernel additions — an honest stub, not a silent lie. - Real CUDA memory management.
cudaMalloc,cudaMallocManaged(unified memory), andcudaHostAlloc(pinned host memory) are all live. TheCudaMemoryManagerCoordinatorhas real fragmentation analysis, real pressure levels (Normal/Low/Medium/High/Critical), and real pool wiring — not stubs that silently returnDefault::default(). - Ring all-reduce — bandwidth-optimal. The distributed backend now ships a proper ring all-reduce with the theoretical bandwidth bound:
2(N-1)/N × buffer_size. The previous gather+broadcast approach wasted bandwidth proportional to N; the ring approach does not. AllReduceOpvariants — Sum, Product, Min, Max, Average, Mean — are covered. - 15–30% automatic throughput gain. Phase-4 chunking helpers (
ChunkingUtils::matrix_blocks,chunked_elementwise,chunked_sum,chunked_mean) tile large tensor operations into cache-friendly blocks and deliver measured 15–30% throughput improvements on large tensors with no code change from users. - COOLJAPAN Pure Rust Policy: complete.
ring(C/asm AEAD) replaced byaes-gcm0.11.0-rc.4,chacha20poly13050.11.0-rc.3,pbkdf20.13, andhmac0.13. The default build is now 100% Rust.
Technical Deep Dive: the GPU stack
The compute abstraction. At the bottom sits oxicuda_backend::ComputeBackend — OxiCUDA’s trait for GPU dispatch. torsh-tensor’s new CudaBackend is a thin adapter over three OxiCUDA leaf crates: oxicuda-driver (driver API), oxicuda-launch (kernel launch), and oxicuda-ptx (PTX JIT). The gpu_dispatch.rs module sits above that, routing tensor operations through the trait — the same layer that lets CpuBackend satisfy the same interface for tests on machines without a GPU.
The memory coordinator. torsh-backend/src/cuda/memory/manager.rs now boots a real CudaMemoryManagerCoordinator via OnceLock, wires allocate_from_device_pool and return_to_device_pool through cust::cuda_malloc/free, and exposes configure_predictive_allocation, get_memory_statistics, and get_performance_metrics through live paths. Machines without a GPU return Default::default() before init — no panic, no fabricated data.
The distributed layer. Ring all-reduce in multi_gpu.rs replaces unsafe { mem::transmute } with a ReducibleElement type-safe dispatch trait for f32/f64. The algorithm is standard Horovod-style ring: 2(N-1)/N send-recv steps across the ring, accumulating into the local buffer without a gather step that would blow bandwidth.
The binding surface. Nine N-API handler modules — activations, creation, ops, nn, optim, reductions, clone_detach, helpers, utils_js — complete the Node.js JavaScript binding layer in torsh-ffi. TypeScript definitions ship alongside a Jest test suite. The Python side (torsh-python) migrates to PyO3 0.28’s Bound<'_, PyModule> API and re-enables torsh-data, torsh-autograd, and torsh-distributed, with a new src/data.rs exposing PyDataset, PyDataLoader, and PyDataLoaderIter.
Getting Started
# CPU (SIMD + parallel on by default)
cargo add torsh
# Enable the GPU backend (runtime CUDA driver load — no SDK required at build time)
cargo add torsh --features cuda
use torsh::prelude::*;
fn main() -> Result<()> {
// Phase-4 chunking kicks in automatically on large tensors (15–30% faster)
let x = randn(&[512, 512])?;
let y = randn(&[512, 512])?;
let out = x.matmul(&y)?;
println!("shape: {:?}", out.shape());
Ok(())
}
Opt into the GPU dispatch path:
use torsh::prelude::*;
use torsh_tensor::gpu_dispatch::GpuDispatch;
// GpuDispatch routes unary/binary f32 ops through CudaBackend when available,
// falling back to CpuBackend on machines without a driver.
let dispatch = GpuDispatch::new()?;
let x = randn_f32(&[1024])?;
let y = dispatch.relu_f32(&x)?;
println!("relu output on GPU (or CPU fallback): {:?}", y.shape());
From Node.js, after building the native module:
const torsh = require('@torsh/core');
const x = torsh.randn([64, 64]);
const y = torsh.matmul(x, x);
console.log('output shape:', y.shape());
What’s New in 0.1.3
Added
CudaBackendintorsh-tensor: OxiCUDA compute backend, no CUDA SDK at build timeptx_ops.rs: PTX-backed unary/binary/reduce dispatch throughoxicuda-ptxgpu_dispatch.rs: unified dispatch layer overdyn ComputeBackendgpuandcudafeature flags intorsh-tensor;oxicuda-backend/driver/launch/ptxin workspace deps- Real CUDA allocators:
cudaMalloc,cudaMallocManaged,cudaHostAlloc; realcalculate_fragmentation_level ReducibleElementtype-safe dispatch trait (replacesunsafe { mem::transmute }inmulti_gpu.rs)- Ring all-reduce: bandwidth-optimal
2(N-1)/N × buffer_sizealgorithm; allReduceOpvariants - 14 new distributed tests (
tests_p3) covering ring all-reduce correctness and edge cases - Phase-4 chunking helpers:
ChunkingUtils::matrix_blocks,chunked_elementwise,chunked_sum,chunked_mean - SIMD-accelerated forward pass:
simd_optimized_forwardusingscirs2_core::simd_ops::simd_matrix_multiply_f32 - 9 N-API handler modules completing the Node.js binding layer in
torsh-ffi - TypeScript definitions and Jest test suite for Node.js bindings
- PyO3 0.28 migration in
torsh-python;PyDataset/PyDataLoader/PyDataLoaderIterinsrc/data.rs NaturalCubicSplinestruct and spline interpolation intorsh-seriesSsaModel(Singular Spectrum Analysis):fit/forecastwith power-iteration eigenvectors- MSTL (Multiple Seasonal-Trend decomposition using Loess) in
torsh-series - LSTM, Transformer, and CNN-based forecasters in
torsh-series::forecast::deep - Proper p-values for
augmented_dickey_fuller_test,kpss_test,phillips_perron_test DifferentialFlamegraph::compare()intorsh-autograd: realFlamegraphComparison/FrameDeltatorsh-vision:ImageRegistrar::apply_transformation,FramePreprocessor, 3D visualization utilities- Flash attention (causal and non-causal) in
torsh-functional::attention - Cross-platform SIMD validation benchmark in
torsh-benches
Changed
- All scirs2 deps bumped 0.4.2 → 0.5.1 (19 sub-crates)
oxiarc-archive/oxiarc-corebumped 0.2.7 → 0.3.3;oxiarc-deflate/oxiarc-zstdbumped 0.2 → 0.3.3oxifftupdated to 0.3.2;oxionnxto 0.1.4;oxicodeto 0.2.4gpufeature intorsh-autogradis now empty (backward GPU dispatch deferred to OxiCUDA)- CUDA memory optimization files refactored below the 2000-line policy limit
Fixed
ring(C/asm) replaced by pure-Rust RustCrypto AEAD (aes-gcm,chacha20poly1305,pbkdf2,hmac) — COOLJAPAN Pure Rust Policy now fully enforced in the default build- 2 root compilation errors that were blocking 1100+ downstream optimization-module errors
torsh-distributed: eradicated silent fabrications in cluster stateBernoulliDistributionsampling now generates correct binary outputs from probabilitiesQuadraticProgrammingLayer::backwardshape mismatch fixed
Tips
- Enable CUDA with a single feature flag — no SDK needed at build. Add
features = ["cuda"]to yourtorshdependency. The driver is loaded at runtime via libloading; the binary compiles and runs on CPU-only boxes without change. - Use chunking helpers for large matrix ops.
ChunkingUtils::matrix_blocks(m, n, k, 4)gives you row-strip blocking automatically;chunked_elementwise/chunked_sum/chunked_meantile reductions. These are the same helpers that deliver the 15–30% throughput gains and they are available in your own kernel code. - Migrate distributed ops to ring all-reduce. If you were manually gathering results across GPUs, replace with
ring_all_reduce— the bandwidth is2(N-1)/N × buffer_sizeregardless of N, vs. the old gather approach which scaled linearly. - Node.js bindings are now complete.
torsh-ffiexposes activations, creation, elementwise ops, nn layers, optimizers, and reductions through nine N-API modules with TypeScript definitions. Build withcargo build --release --features nodejsandnpm run build:js. - Python dataloader now round-trips.
torsh-python’sPyDataLoaderandPyDataLoaderIterwrap the realtorsh-dataAPI;torsh-autogradandtorsh-distributedare re-enabled. Useimport rstorch; dl = rstorch.data.DataLoader(dataset, batch_size=32). - Time-series forecasting now has deep models.
torsh-series::forecast::deepships LSTM, Transformer, and CNN forecasters alongside the new SSA and MSTL decompositions. Statistical test p-values (adf,kpss,pp) are now computed properly rather than hardcoded.
This is the foundation
ToRSh 0.1.3 is powered by — and in turn powers — the wider COOLJAPAN ecosystem:
- OxiCUDA (
oxicuda-backend,oxicuda-driver,oxicuda-launch,oxicuda-ptx) — the pure-Rust GPU compute stack that makes the new backend possible - SciRS2 0.5.1 — 19 sub-crates providing the scientific computing foundation: SIMD ops, BLAS, FFT, autograd, sparse, signal, graph, series, vision, text, and more
- OxiBLAS — pure-Rust BLAS/LAPACK, reached through
scirs2-core’soxiblas-blasandoxiblas-lapackfeatures - OxiARC 0.3.3 — pure-Rust compression/archives; powers model-hub streaming extraction
- OxiCode 0.2.4 — binary serialization for model checkpoints
- OxiFFT 0.3.2 — FFT used throughout the signal and series crates
- OxiONNX 0.1.4 — ONNX interop
- OptiRS 0.3.1 — 70+ optimizers available through
torsh-optim
Repository: https://github.com/cool-japan/torsh
Star the repo if a pure-Rust, single-binary deep-learning framework with a real GPU backend — and not a line of C/asm in the default build — is something you want to see reach 1.0.
The era of mandatory C++ runtimes and CUDA SDK lockout is over. Pure Rust deep learning is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ
June 30, 2026