COOLJAPAN
← All posts

ToRSh 0.1.2 Released — Real AVX2/NEON SIMD and a Zero-Copy Tensor Memory Pool

ToRSh is a pure-Rust, PyTorch-compatible deep-learning framework with native tensor sharding. 0.1.2 lands real AVX2/NEON SIMD for f32 ops and activations, a true zero-copy buffer pool (100% heap-block reduction on hot loops), and SIMD + parallel enabled by default.

release torsh deep-learning pytorch rust simd performance tensor

The fake SIMD is gone. The real AVX2/NEON arithmetic is here — and the tensor memory pool finally stops copying.

Today we released ToRSh 0.1.2 — the SIMD performance release, where the placeholder fast paths from earlier versions are replaced with genuine AVX2/NEON-accelerated f32 arithmetic and activations, and the memory pool becomes a true zero-copy buffer-reuse pool.

ToRSh — “Tensor Operations in Rust with Sharding” — is a PyTorch-compatible deep-learning framework built entirely in pure Rust. No C. No C++. No Fortran. No Python runtime. Where PyTorch leans on libtorch/ATen (a vast C++ codebase), the CUDA toolchain, and a Python interpreter just to add two tensors, ToRSh compiles to a single static binary you can drop onto a server with nothing else installed. And in 0.1.2 even the SIMD is pure Rust: there are no hand-written intrinsics buried in a C shim and no MKL dependency. The vectorized f32 kernels dispatch through SciRS2’s SimdUnifiedOps, which selects AVX2 on x86-64, NEON on ARM, and a scalar fallback elsewhere — all from safe, portable Rust.

Why 0.1.2 is a performance leap

This release is honest about a problem. In 0.1.1, the simd feature was, in part, a placeholder: behind the feature gate sat a par_iter branch dressed up as SIMD, doing scalar work across threads rather than vectorizing within a lane. Worse, the GlobalMemoryPool had a quiet bug — every “pool hit” copied the pooled data into a fresh Vec, which meant the pool was not actually saving any allocations. It looked like a pool. It did not behave like one.

0.1.2 makes both of these real:

Technical Deep Dive: the new fast path

The SIMD layer. At the bottom sits scirs2_core::simd_ops::SimdUnifiedOps, the unified SIMD abstraction from SciRS2 that knows how to pick AVX2, NEON, or a scalar fallback at runtime. On top of it, ToRSh’s new simd_ops_f32 module provides zero-allocation helpers — add_into_f32, sub_into_f32, mul_into_f32, div_into_f32 for writing into an output buffer, and add_assign_f32, sub_assign_f32, mul_assign_f32, div_assign_f32 for true in-place updates — along with relu_assign_f32, leaky_relu_assign_f32, and clamp_assign_f32 for the activations. Op selection itself is branch-free: a BinaryF32Op enum with dispatch_into and dispatch_inplace chooses the kernel without a per-element match. The tensor layer wires this together with a simple rule — any f32 tensor op on 1024 elements or more auto-selects the SIMD path; smaller tensors stay scalar, where the dispatch overhead would not pay off. Note the default-features change here: with ["std", "simd", "parallel"], the simd feature also enables scirs2-core/simd, so the underlying SciRS2 intrinsics activate automatically.

The memory subsystem. The headline fix lives in GlobalMemoryPool. The new acquire_uninit<T> method (and its free-function form global_acquire_uninit<T>) returns a ReusedBuffer<T> — a real RAII handle over the pooled allocation. On a pool hit it is genuinely zero-copy: you get the buffer the pool was holding, not a clone of its contents. ReusedBuffer<T> exposes as_uninit_slice_mut to write into the raw storage, into_vec to take ownership, and release_to_pool to return it explicitly — and if you do nothing, it auto-returns to the pool on drop. The old GlobalMemoryPool::allocate<T> is now deprecated (still present for compatibility); new code should move to global_acquire_uninit.

The model-hub I/O path. Downloading a large pretrained model used to mean materializing the whole archive in a Vec<u8>. 0.1.2 replaces that with TarStreamReader, which streams .tar.gz extraction through OxiARC (oxiarc-* 0.2.7, the pure-Rust compression/archive stack). Memory stays constant at roughly the size of a single TAR record header, regardless of how big the model is.

The validation harness. None of the above is taken on faith. benches/alloc_tracking.rs uses dhat to count heap blocks and prove the 10,000-to-0 reduction; benches/regression_baselines.rs provides criterion baselines for the hot kernels; and scripts/check_perf_regression.sh is the CI gate that fails a build when those baselines slip. ToRSh ships with more than 9,600 tests across the workspace, and 0.1.2 keeps that suite green.

Getting Started

# SIMD and parallel are now ON by default — no extra feature flags needed.
cargo add torsh
use torsh::prelude::*;

fn main() -> Result<()> {
    // Tensors with >= 1024 elements automatically take the real AVX2/NEON path.
    let x = randn(&[64, 64])?; // 4096 f32 elements
    let y = randn(&[64, 64])?;

    // Elementwise add and multiply dispatch to vectorized f32 kernels.
    let sum = x.add(&y)?;
    let prod = sum.mul(&x)?;

    // In-place, zero-allocation SIMD activation on the fast path.
    let mut activated = prod;
    activated.relu_()?;

    println!("output shape: {:?}", activated.shape());
    Ok(())
}

If you are managing scratch buffers in a hot loop, you can pull a genuinely pooled, zero-copy allocation directly:

use torsh::prelude::*;

// On a pool hit this reuses the existing allocation (no copy).
// The ReusedBuffer returns itself to the pool when it drops.
let mut buf = global_acquire_uninit::<f32>(4096)?;
let slice = buf.as_uninit_slice_mut();
// ... fill `slice`, run your kernel ...

What’s New in 0.1.2

Added

Changed

Fixed

Tips

ToRSh 0.1.2 keeps the foundation from 0.1.0 and 0.1.1: roughly 400 PyTorch-compatible tensor ops, reverse-mode autograd, a full nn layer set, 70+ optimizers (including OptiRS), a parallel DataLoader, a Cranelift JIT, INT8 quantization, a model hub, distributed training (DDP/FSDP), CPU/CUDA/Metal backends, and deep SciRS2 scientific integration — GNNs via torsh-graph, time-series via torsh-series, computer vision via torsh-vision, plus sparse and special-function support — across a 33-crate workspace.

Part of the COOLJAPAN ecosystem

ToRSh is built on the COOLJAPAN pure-Rust stack:

ToRSh follows the COOLJAPAN Pure Rust Policy: all compression and decompression goes through OxiARC. There is no C-backed zip, zstd, or flate2 in the default build — the same sovereignty that lets ToRSh skip libtorch, CUDA, and the Python interpreter.

Repository: https://github.com/cool-japan/torsh

Star the repo if a pure-Rust, single-binary alternative to PyTorch — with real SIMD and a real memory pool — is something you want to see grow.

Pure Rust deep learning just got faster — and it is still 100% sovereign.

KitaSan at COOLJAPAN OÜ April 27, 2026

↑ Back to all posts