ToRSh 0.1.2 Released — Real AVX2/NEON SIMD and a Zero-Copy Tensor Memory Pool

The fake SIMD is gone. The real AVX2/NEON arithmetic is here — and the tensor memory pool finally stops copying.

Today we released ToRSh 0.1.2 — the SIMD performance release, where the placeholder fast paths from earlier versions are replaced with genuine AVX2/NEON-accelerated f32 arithmetic and activations, and the memory pool becomes a true zero-copy buffer-reuse pool.

ToRSh — “Tensor Operations in Rust with Sharding” — is a PyTorch-compatible deep-learning framework built entirely in pure Rust. No C. No C++. No Fortran. No Python runtime. Where PyTorch leans on libtorch/ATen (a vast C++ codebase), the CUDA toolchain, and a Python interpreter just to add two tensors, ToRSh compiles to a single static binary you can drop onto a server with nothing else installed. And in 0.1.2 even the SIMD is pure Rust: there are no hand-written intrinsics buried in a C shim and no MKL dependency. The vectorized f32 kernels dispatch through SciRS2’s SimdUnifiedOps, which selects AVX2 on x86-64, NEON on ARM, and a scalar fallback elsewhere — all from safe, portable Rust.

Why 0.1.2 is a performance leap

This release is honest about a problem. In 0.1.1, the simd feature was, in part, a placeholder: behind the feature gate sat a par_iter branch dressed up as SIMD, doing scalar work across threads rather than vectorizing within a lane. Worse, the GlobalMemoryPool had a quiet bug — every “pool hit” copied the pooled data into a fresh Vec, which meant the pool was not actually saving any allocations. It looked like a pool. It did not behave like one.

0.1.2 makes both of these real:

Real AVX2/NEON f32 dispatch. Tensor::add / sub / mul / div for f32 tensors of 1024 elements or more now route into a new simd_ops_f32 module that performs genuine vectorized arithmetic — AVX2 on x86-64, NEON on ARM — instead of the old fake par_iter branch.
Real SIMD in-place activations. Tensor::relu_ / leaky_relu_ / clamp_ for f32 tensors of 1024+ elements now run as zero-allocation, in-place SIMD kernels with PyTorch-compatible NaN passthrough.
True zero-copy pooled buffers. The new ReusedBuffer<T> RAII type hands you the actual pooled allocation on a pool hit — no copy, no fresh Vec.
Proven, not promised. A new dhat allocation-tracking benchmark shows the pool taking a hot loop from 10,000 heap allocation blocks down to 0 — a 100% reduction — when pooling is enabled.
Performance on by default. default = ["std", "simd", "parallel"] — SIMD and parallelism now ship enabled. You no longer have to remember to opt in.
Streaming model I/O. torsh-hub now streams .tar.gz extraction in O(512 B) memory instead of buffering the whole archive — a real win for large pretrained models.
Regressions get caught. A criterion-based regression suite plus a CI threshold script guard against future performance drift.

Technical Deep Dive: the new fast path

The SIMD layer. At the bottom sits scirs2_core::simd_ops::SimdUnifiedOps, the unified SIMD abstraction from SciRS2 that knows how to pick AVX2, NEON, or a scalar fallback at runtime. On top of it, ToRSh’s new simd_ops_f32 module provides zero-allocation helpers — add_into_f32, sub_into_f32, mul_into_f32, div_into_f32 for writing into an output buffer, and add_assign_f32, sub_assign_f32, mul_assign_f32, div_assign_f32 for true in-place updates — along with relu_assign_f32, leaky_relu_assign_f32, and clamp_assign_f32 for the activations. Op selection itself is branch-free: a BinaryF32Op enum with dispatch_into and dispatch_inplace chooses the kernel without a per-element match. The tensor layer wires this together with a simple rule — any f32 tensor op on 1024 elements or more auto-selects the SIMD path; smaller tensors stay scalar, where the dispatch overhead would not pay off. Note the default-features change here: with ["std", "simd", "parallel"], the simd feature also enables scirs2-core/simd, so the underlying SciRS2 intrinsics activate automatically.

The memory subsystem. The headline fix lives in GlobalMemoryPool. The new acquire_uninit<T> method (and its free-function form global_acquire_uninit<T>) returns a ReusedBuffer<T> — a real RAII handle over the pooled allocation. On a pool hit it is genuinely zero-copy: you get the buffer the pool was holding, not a clone of its contents. ReusedBuffer<T> exposes as_uninit_slice_mut to write into the raw storage, into_vec to take ownership, and release_to_pool to return it explicitly — and if you do nothing, it auto-returns to the pool on drop. The old GlobalMemoryPool::allocate<T> is now deprecated (still present for compatibility); new code should move to global_acquire_uninit.

The model-hub I/O path. Downloading a large pretrained model used to mean materializing the whole archive in a Vec<u8>. 0.1.2 replaces that with TarStreamReader, which streams .tar.gz extraction through OxiARC (oxiarc-* 0.2.7, the pure-Rust compression/archive stack). Memory stays constant at roughly the size of a single TAR record header, regardless of how big the model is.

The validation harness. None of the above is taken on faith. benches/alloc_tracking.rs uses dhat to count heap blocks and prove the 10,000-to-0 reduction; benches/regression_baselines.rs provides criterion baselines for the hot kernels; and scripts/check_perf_regression.sh is the CI gate that fails a build when those baselines slip. ToRSh ships with more than 9,600 tests across the workspace, and 0.1.2 keeps that suite green.

Getting Started

# SIMD and parallel are now ON by default — no extra feature flags needed.
cargo add torsh

use torsh::prelude::*;

fn main() -> Result<()> {
    // Tensors with >= 1024 elements automatically take the real AVX2/NEON path.
    let x = randn(&[64, 64])?; // 4096 f32 elements
    let y = randn(&[64, 64])?;

    // Elementwise add and multiply dispatch to vectorized f32 kernels.
    let sum = x.add(&y)?;
    let prod = sum.mul(&x)?;

    // In-place, zero-allocation SIMD activation on the fast path.
    let mut activated = prod;
    activated.relu_()?;

    println!("output shape: {:?}", activated.shape());
    Ok(())
}

If you are managing scratch buffers in a hot loop, you can pull a genuinely pooled, zero-copy allocation directly:

use torsh::prelude::*;

// On a pool hit this reuses the existing allocation (no copy).
// The ReusedBuffer returns itself to the pool when it drops.
let mut buf = global_acquire_uninit::<f32>(4096)?;
let slice = buf.as_uninit_slice_mut();
// ... fill `slice`, run your kernel ...

What’s New in 0.1.2

Added

New simd_ops_f32 module: zero-allocation SIMD helpers add_into_f32 / sub_into_f32 / mul_into_f32 / div_into_f32 and in-place add_assign_f32 / sub_assign_f32 / mul_assign_f32 / div_assign_f32.
SIMD in-place activations: relu_assign_f32, leaky_relu_assign_f32, clamp_assign_f32 with PyTorch-compatible NaN passthrough.
BinaryF32Op enum with dispatch_into / dispatch_inplace for branch-free op dispatch.
GlobalMemoryPool::acquire_uninit<T> / global_acquire_uninit<T> returning a real ReusedBuffer<T> RAII handle (as_uninit_slice_mut, into_vec, release_to_pool, auto-return on drop) — true zero-copy pooled reuse.
dhat allocation-tracking benchmark (benches/alloc_tracking.rs) proving 10,000 heap blocks drop to 0 (100% reduction) on hot loops.
Criterion performance-regression framework (benches/regression_baselines.rs) plus the scripts/check_perf_regression.sh CI threshold script.
torsh-hub TarStreamReader for streaming .tar.gz extraction in O(512 B) memory.

Changed

Tensor::add / sub / mul / div for f32 tensors >= 1024 elements now dispatch to real AVX2/NEON SIMD (replacing the old fake par_iter branch).
Tensor::add_ / sub_ / mul_ / div_ for f32 >= 1024 elements are now zero-allocation SIMD in-place.
Tensor::relu_ / leaky_relu_ / clamp_ for f32 >= 1024 elements are now SIMD-dispatched in place.
Default features are now ["std", "simd", "parallel"] — SIMD and parallelism on by default; simd also enables scirs2-core/simd.
GlobalMemoryPool::allocate<T> is deprecated in favor of global_acquire_uninit.
Dependency upgrades: wgpu 28.0.0 → 29.0.1 (full API migration), sha2 0.10 → 0.11 (hash output via hex::encode()), cranelift 0.130 → 0.131, prometheus 0.13 → 0.14, quickcheck 1.0 → 1.1, unicode-segmentation 1.12 → 1.13, imageproc 0.25 → 0.26. All local oxiarc-* path deps moved to published registry versions (0.2.7); scirs2-core to 0.4.2 and OptiRS to 0.3.1.
math_ops.rs split into math_ops.rs and math_ops_tests.rs to stay under the 2000-line policy limit.

Fixed

GlobalMemoryPool pool hits no longer copy into a new Vec — the pool now actually saves allocations.
5 doctests in torsh-distributed fixed (missing .await on async calls in nccl_ops, alerting, prometheus_exporter, three_d_parallelism, and zero_3_cpu_offload).

Tips

Stay above 1024 elements to keep the SIMD fast path. f32 tensor ops only dispatch to AVX2/NEON at 1024 elements or more; below that they run scalar. Batch small work together where you can.
Prefer in-place ops in hot loops. add_, mul_, relu_, clamp_ on f32 >= 1024 elements give you the zero-allocation SIMD path. The out-of-place variants are convenient but allocate an output.
Migrate off allocate. Replace GlobalMemoryPool::allocate with global_acquire_uninit and let ReusedBuffer<T> drop back into the pool — that is where the real allocation savings (and the 10,000-to-0 win) come from.
You no longer need --features simd. SIMD and parallel are default-on now; passing them explicitly is harmless but unnecessary.
Wire the regression gate into your own CI. Drop scripts/check_perf_regression.sh into your pipeline to fail builds when the criterion baselines regress.
Large model downloads now stream. Pulling big pretrained weights through torsh-hub runs in constant memory thanks to OxiARC’s streaming TAR extraction — no need to size your box for the archive.

ToRSh 0.1.2 keeps the foundation from 0.1.0 and 0.1.1: roughly 400 PyTorch-compatible tensor ops, reverse-mode autograd, a full nn layer set, 70+ optimizers (including OptiRS), a parallel DataLoader, a Cranelift JIT, INT8 quantization, a model hub, distributed training (DDP/FSDP), CPU/CUDA/Metal backends, and deep SciRS2 scientific integration — GNNs via torsh-graph, time-series via torsh-series, computer vision via torsh-vision, plus sparse and special-function support — across a 33-crate workspace.

Part of the COOLJAPAN ecosystem

ToRSh is built on the COOLJAPAN pure-Rust stack:

SciRS2 — the scientific computing platform; its simd_ops::SimdUnifiedOps is exactly what powers ToRSh’s new AVX2/NEON SIMD path.
OxiBLAS — pure-Rust BLAS/LAPACK, reached through scirs2-core’s oxiblas features.
OxiCode — binary serialization.
OptiRS — the optimizer library (0.3.1).
OxiARC — pure-Rust compression and archives (oxiarc-* 0.2.7), powering the new streaming TAR extraction in torsh-hub.

ToRSh follows the COOLJAPAN Pure Rust Policy: all compression and decompression goes through OxiARC. There is no C-backed zip, zstd, or flate2 in the default build — the same sovereignty that lets ToRSh skip libtorch, CUDA, and the Python interpreter.

Repository: https://github.com/cool-japan/torsh

Star the repo if a pure-Rust, single-binary alternative to PyTorch — with real SIMD and a real memory pool — is something you want to see grow.

Pure Rust deep learning just got faster — and it is still 100% sovereign.

— KitaSan at COOLJAPAN OÜ April 27, 2026