The fake SIMD is gone. The real AVX2/NEON arithmetic is here — and the tensor memory pool finally stops copying.
Today we released ToRSh 0.1.2 — the SIMD performance release, where the placeholder fast paths from earlier versions are replaced with genuine AVX2/NEON-accelerated f32 arithmetic and activations, and the memory pool becomes a true zero-copy buffer-reuse pool.
ToRSh — “Tensor Operations in Rust with Sharding” — is a PyTorch-compatible deep-learning framework built entirely in pure Rust. No C. No C++. No Fortran. No Python runtime. Where PyTorch leans on libtorch/ATen (a vast C++ codebase), the CUDA toolchain, and a Python interpreter just to add two tensors, ToRSh compiles to a single static binary you can drop onto a server with nothing else installed. And in 0.1.2 even the SIMD is pure Rust: there are no hand-written intrinsics buried in a C shim and no MKL dependency. The vectorized f32 kernels dispatch through SciRS2’s SimdUnifiedOps, which selects AVX2 on x86-64, NEON on ARM, and a scalar fallback elsewhere — all from safe, portable Rust.
Why 0.1.2 is a performance leap
This release is honest about a problem. In 0.1.1, the simd feature was, in part, a placeholder: behind the feature gate sat a par_iter branch dressed up as SIMD, doing scalar work across threads rather than vectorizing within a lane. Worse, the GlobalMemoryPool had a quiet bug — every “pool hit” copied the pooled data into a fresh Vec, which meant the pool was not actually saving any allocations. It looked like a pool. It did not behave like one.
0.1.2 makes both of these real:
- Real AVX2/NEON f32 dispatch.
Tensor::add/sub/mul/divfor f32 tensors of 1024 elements or more now route into a newsimd_ops_f32module that performs genuine vectorized arithmetic — AVX2 on x86-64, NEON on ARM — instead of the old fakepar_iterbranch. - Real SIMD in-place activations.
Tensor::relu_/leaky_relu_/clamp_for f32 tensors of 1024+ elements now run as zero-allocation, in-place SIMD kernels with PyTorch-compatible NaN passthrough. - True zero-copy pooled buffers. The new
ReusedBuffer<T>RAII type hands you the actual pooled allocation on a pool hit — no copy, no freshVec. - Proven, not promised. A new
dhatallocation-tracking benchmark shows the pool taking a hot loop from 10,000 heap allocation blocks down to 0 — a 100% reduction — when pooling is enabled. - Performance on by default.
default = ["std", "simd", "parallel"]— SIMD and parallelism now ship enabled. You no longer have to remember to opt in. - Streaming model I/O.
torsh-hubnow streams.tar.gzextraction in O(512 B) memory instead of buffering the whole archive — a real win for large pretrained models. - Regressions get caught. A criterion-based regression suite plus a CI threshold script guard against future performance drift.
Technical Deep Dive: the new fast path
The SIMD layer. At the bottom sits scirs2_core::simd_ops::SimdUnifiedOps, the unified SIMD abstraction from SciRS2 that knows how to pick AVX2, NEON, or a scalar fallback at runtime. On top of it, ToRSh’s new simd_ops_f32 module provides zero-allocation helpers — add_into_f32, sub_into_f32, mul_into_f32, div_into_f32 for writing into an output buffer, and add_assign_f32, sub_assign_f32, mul_assign_f32, div_assign_f32 for true in-place updates — along with relu_assign_f32, leaky_relu_assign_f32, and clamp_assign_f32 for the activations. Op selection itself is branch-free: a BinaryF32Op enum with dispatch_into and dispatch_inplace chooses the kernel without a per-element match. The tensor layer wires this together with a simple rule — any f32 tensor op on 1024 elements or more auto-selects the SIMD path; smaller tensors stay scalar, where the dispatch overhead would not pay off. Note the default-features change here: with ["std", "simd", "parallel"], the simd feature also enables scirs2-core/simd, so the underlying SciRS2 intrinsics activate automatically.
The memory subsystem. The headline fix lives in GlobalMemoryPool. The new acquire_uninit<T> method (and its free-function form global_acquire_uninit<T>) returns a ReusedBuffer<T> — a real RAII handle over the pooled allocation. On a pool hit it is genuinely zero-copy: you get the buffer the pool was holding, not a clone of its contents. ReusedBuffer<T> exposes as_uninit_slice_mut to write into the raw storage, into_vec to take ownership, and release_to_pool to return it explicitly — and if you do nothing, it auto-returns to the pool on drop. The old GlobalMemoryPool::allocate<T> is now deprecated (still present for compatibility); new code should move to global_acquire_uninit.
The model-hub I/O path. Downloading a large pretrained model used to mean materializing the whole archive in a Vec<u8>. 0.1.2 replaces that with TarStreamReader, which streams .tar.gz extraction through OxiARC (oxiarc-* 0.2.7, the pure-Rust compression/archive stack). Memory stays constant at roughly the size of a single TAR record header, regardless of how big the model is.
The validation harness. None of the above is taken on faith. benches/alloc_tracking.rs uses dhat to count heap blocks and prove the 10,000-to-0 reduction; benches/regression_baselines.rs provides criterion baselines for the hot kernels; and scripts/check_perf_regression.sh is the CI gate that fails a build when those baselines slip. ToRSh ships with more than 9,600 tests across the workspace, and 0.1.2 keeps that suite green.
Getting Started
# SIMD and parallel are now ON by default — no extra feature flags needed.
cargo add torsh
use torsh::prelude::*;
fn main() -> Result<()> {
// Tensors with >= 1024 elements automatically take the real AVX2/NEON path.
let x = randn(&[64, 64])?; // 4096 f32 elements
let y = randn(&[64, 64])?;
// Elementwise add and multiply dispatch to vectorized f32 kernels.
let sum = x.add(&y)?;
let prod = sum.mul(&x)?;
// In-place, zero-allocation SIMD activation on the fast path.
let mut activated = prod;
activated.relu_()?;
println!("output shape: {:?}", activated.shape());
Ok(())
}
If you are managing scratch buffers in a hot loop, you can pull a genuinely pooled, zero-copy allocation directly:
use torsh::prelude::*;
// On a pool hit this reuses the existing allocation (no copy).
// The ReusedBuffer returns itself to the pool when it drops.
let mut buf = global_acquire_uninit::<f32>(4096)?;
let slice = buf.as_uninit_slice_mut();
// ... fill `slice`, run your kernel ...
What’s New in 0.1.2
Added
- New
simd_ops_f32module: zero-allocation SIMD helpersadd_into_f32/sub_into_f32/mul_into_f32/div_into_f32and in-placeadd_assign_f32/sub_assign_f32/mul_assign_f32/div_assign_f32. - SIMD in-place activations:
relu_assign_f32,leaky_relu_assign_f32,clamp_assign_f32with PyTorch-compatible NaN passthrough. BinaryF32Openum withdispatch_into/dispatch_inplacefor branch-free op dispatch.GlobalMemoryPool::acquire_uninit<T>/global_acquire_uninit<T>returning a realReusedBuffer<T>RAII handle (as_uninit_slice_mut,into_vec,release_to_pool, auto-return on drop) — true zero-copy pooled reuse.dhatallocation-tracking benchmark (benches/alloc_tracking.rs) proving 10,000 heap blocks drop to 0 (100% reduction) on hot loops.- Criterion performance-regression framework (
benches/regression_baselines.rs) plus thescripts/check_perf_regression.shCI threshold script. torsh-hubTarStreamReaderfor streaming.tar.gzextraction in O(512 B) memory.
Changed
Tensor::add/sub/mul/divfor f32 tensors >= 1024 elements now dispatch to real AVX2/NEON SIMD (replacing the old fakepar_iterbranch).Tensor::add_/sub_/mul_/div_for f32 >= 1024 elements are now zero-allocation SIMD in-place.Tensor::relu_/leaky_relu_/clamp_for f32 >= 1024 elements are now SIMD-dispatched in place.- Default features are now
["std", "simd", "parallel"]— SIMD and parallelism on by default;simdalso enablesscirs2-core/simd. GlobalMemoryPool::allocate<T>is deprecated in favor ofglobal_acquire_uninit.- Dependency upgrades: wgpu 28.0.0 → 29.0.1 (full API migration), sha2 0.10 → 0.11 (hash output via
hex::encode()), cranelift 0.130 → 0.131, prometheus 0.13 → 0.14, quickcheck 1.0 → 1.1, unicode-segmentation 1.12 → 1.13, imageproc 0.25 → 0.26. All localoxiarc-*path deps moved to published registry versions (0.2.7); scirs2-core to 0.4.2 and OptiRS to 0.3.1. math_ops.rssplit intomath_ops.rsandmath_ops_tests.rsto stay under the 2000-line policy limit.
Fixed
GlobalMemoryPoolpool hits no longer copy into a newVec— the pool now actually saves allocations.- 5 doctests in
torsh-distributedfixed (missing.awaiton async calls innccl_ops,alerting,prometheus_exporter,three_d_parallelism, andzero_3_cpu_offload).
Tips
- Stay above 1024 elements to keep the SIMD fast path. f32 tensor ops only dispatch to AVX2/NEON at 1024 elements or more; below that they run scalar. Batch small work together where you can.
- Prefer in-place ops in hot loops.
add_,mul_,relu_,clamp_on f32 >= 1024 elements give you the zero-allocation SIMD path. The out-of-place variants are convenient but allocate an output. - Migrate off
allocate. ReplaceGlobalMemoryPool::allocatewithglobal_acquire_uninitand letReusedBuffer<T>drop back into the pool — that is where the real allocation savings (and the 10,000-to-0 win) come from. - You no longer need
--features simd. SIMD and parallel are default-on now; passing them explicitly is harmless but unnecessary. - Wire the regression gate into your own CI. Drop
scripts/check_perf_regression.shinto your pipeline to fail builds when the criterion baselines regress. - Large model downloads now stream. Pulling big pretrained weights through
torsh-hubruns in constant memory thanks to OxiARC’s streaming TAR extraction — no need to size your box for the archive.
ToRSh 0.1.2 keeps the foundation from 0.1.0 and 0.1.1: roughly 400 PyTorch-compatible tensor ops, reverse-mode autograd, a full nn layer set, 70+ optimizers (including OptiRS), a parallel DataLoader, a Cranelift JIT, INT8 quantization, a model hub, distributed training (DDP/FSDP), CPU/CUDA/Metal backends, and deep SciRS2 scientific integration — GNNs via torsh-graph, time-series via torsh-series, computer vision via torsh-vision, plus sparse and special-function support — across a 33-crate workspace.
Part of the COOLJAPAN ecosystem
ToRSh is built on the COOLJAPAN pure-Rust stack:
- SciRS2 — the scientific computing platform; its
simd_ops::SimdUnifiedOpsis exactly what powers ToRSh’s new AVX2/NEON SIMD path. - OxiBLAS — pure-Rust BLAS/LAPACK, reached through scirs2-core’s oxiblas features.
- OxiCode — binary serialization.
- OptiRS — the optimizer library (0.3.1).
- OxiARC — pure-Rust compression and archives (oxiarc-* 0.2.7), powering the new streaming TAR extraction in torsh-hub.
ToRSh follows the COOLJAPAN Pure Rust Policy: all compression and decompression goes through OxiARC. There is no C-backed zip, zstd, or flate2 in the default build — the same sovereignty that lets ToRSh skip libtorch, CUDA, and the Python interpreter.
Repository: https://github.com/cool-japan/torsh
Star the repo if a pure-Rust, single-binary alternative to PyTorch — with real SIMD and a real memory pool — is something you want to see grow.
Pure Rust deep learning just got faster — and it is still 100% sovereign.
— KitaSan at COOLJAPAN OÜ April 27, 2026