COOLJAPAN
← All posts

ToRSh 0.1.3 Released — GPU Backend via OxiCUDA and Zero C/asm in the Build

ToRSh 0.1.3 lands the oxicuda GPU backend (no CUDA SDK at build time), eliminates the last C/asm dependency, ships a bandwidth-optimal ring all-reduce, completes the Node.js N-API layer, and delivers 15–30% throughput gains from phase-4 chunking helpers.

release torsh deep-learning pytorch rust cuda gpu napi nodejs pure-rust oxicuda distributed

The GPU backend is real, the last C/asm dep is gone, and ToRSh now speaks JavaScript.

Today we released ToRSh 0.1.3 — the GPU and sovereignty release, where the OxiCUDA compute backend plugs in without requiring the CUDA SDK at build time, the final C/asm dependency (ring) is replaced by pure-Rust RustCrypto, ring all-reduce arrives for multi-GPU training, and the Node.js N-API binding layer reaches completion.

ToRSh — “Tensor Operations in Rust with Sharding” — is a PyTorch-compatible deep-learning framework built entirely in pure Rust. No C. No C++. No Fortran. No Python runtime. Where PyTorch depends on libtorch/ATen, a full CUDA toolchain, and a Python interpreter just to run inference, ToRSh compiles to a single static binary you can ship to bare metal, a container, or a WASM target with nothing else installed. As of 0.1.3, that binary is also free of C/asm: the ring crate — the last non-Rust spot in the default build — has been swapped for RustCrypto’s aes-gcm, chacha20poly1305, pbkdf2, and hmac.

Why ToRSh 0.1.3 is the GPU inflection point

Every previous release deferred true GPU work: 0.1.0 built the foundation, 0.1.1 added domain crates, 0.1.2 made CPU SIMD real. 0.1.3 is where the GPU story begins in earnest:

Technical Deep Dive: the GPU stack

The compute abstraction. At the bottom sits oxicuda_backend::ComputeBackend — OxiCUDA’s trait for GPU dispatch. torsh-tensor’s new CudaBackend is a thin adapter over three OxiCUDA leaf crates: oxicuda-driver (driver API), oxicuda-launch (kernel launch), and oxicuda-ptx (PTX JIT). The gpu_dispatch.rs module sits above that, routing tensor operations through the trait — the same layer that lets CpuBackend satisfy the same interface for tests on machines without a GPU.

The memory coordinator. torsh-backend/src/cuda/memory/manager.rs now boots a real CudaMemoryManagerCoordinator via OnceLock, wires allocate_from_device_pool and return_to_device_pool through cust::cuda_malloc/free, and exposes configure_predictive_allocation, get_memory_statistics, and get_performance_metrics through live paths. Machines without a GPU return Default::default() before init — no panic, no fabricated data.

The distributed layer. Ring all-reduce in multi_gpu.rs replaces unsafe { mem::transmute } with a ReducibleElement type-safe dispatch trait for f32/f64. The algorithm is standard Horovod-style ring: 2(N-1)/N send-recv steps across the ring, accumulating into the local buffer without a gather step that would blow bandwidth.

The binding surface. Nine N-API handler modules — activations, creation, ops, nn, optim, reductions, clone_detach, helpers, utils_js — complete the Node.js JavaScript binding layer in torsh-ffi. TypeScript definitions ship alongside a Jest test suite. The Python side (torsh-python) migrates to PyO3 0.28’s Bound<'_, PyModule> API and re-enables torsh-data, torsh-autograd, and torsh-distributed, with a new src/data.rs exposing PyDataset, PyDataLoader, and PyDataLoaderIter.

Getting Started

# CPU (SIMD + parallel on by default)
cargo add torsh

# Enable the GPU backend (runtime CUDA driver load — no SDK required at build time)
cargo add torsh --features cuda
use torsh::prelude::*;

fn main() -> Result<()> {
    // Phase-4 chunking kicks in automatically on large tensors (15–30% faster)
    let x = randn(&[512, 512])?;
    let y = randn(&[512, 512])?;
    let out = x.matmul(&y)?;
    println!("shape: {:?}", out.shape());
    Ok(())
}

Opt into the GPU dispatch path:

use torsh::prelude::*;
use torsh_tensor::gpu_dispatch::GpuDispatch;

// GpuDispatch routes unary/binary f32 ops through CudaBackend when available,
// falling back to CpuBackend on machines without a driver.
let dispatch = GpuDispatch::new()?;
let x = randn_f32(&[1024])?;
let y = dispatch.relu_f32(&x)?;
println!("relu output on GPU (or CPU fallback): {:?}", y.shape());

From Node.js, after building the native module:

const torsh = require('@torsh/core');
const x = torsh.randn([64, 64]);
const y = torsh.matmul(x, x);
console.log('output shape:', y.shape());

What’s New in 0.1.3

Added

Changed

Fixed

Tips

This is the foundation

ToRSh 0.1.3 is powered by — and in turn powers — the wider COOLJAPAN ecosystem:

Repository: https://github.com/cool-japan/torsh

Star the repo if a pure-Rust, single-binary deep-learning framework with a real GPU backend — and not a line of C/asm in the default build — is something you want to see reach 1.0.

The era of mandatory C++ runtimes and CUDA SDK lockout is over. Pure Rust deep learning is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ
June 30, 2026

↑ Back to all posts