TenRSo 0.1.0-rc.1 Released — A Rust-Native Tensor Stack: Generalized Contraction, Decompositions, and Out-of-Core, the Last Step Before Stable

Higher-order tensors deserve a first-class, sovereign home in Rust — and we are one checkpoint away from giving them one.

Today we released TenRSo 0.1.0-rc.1 — a release candidate for our Rust-native tensor computing stack, the last step before our first stable cut.

No C. No Fortran. No native tensor toolkits bolted onto an interpreter, no BLAS-shaped dependency you cannot read. The higher-order tensor world has long leaned on C/C++ libraries — the kind of compiled code scientific Python quietly links against for contractions and decompositions. TenRSo takes a different path: it compiles to a single static binary (and to WASM), Pure Rust end to end. Its compression is OxiARC, its codec is OxiCode, and its linear algebra is SciRS2. Nothing to install alongside, nothing to mismatch, nothing you cannot audit.

TenRSo is also one of the three oldest projects in the COOLJAPAN ecosystem — its first commit landed on 2025-11-08, the same day as its sibling-in-age TensorLogic. After four months of steady maturing, this RC is the project finally drawing close to its first stable release.

Why 0.1.0-rc.1 matters

Doing serious higher-order tensor work outside Rust is a thicket. You fight contraction order by hand, you bounce between dense and sparse representations with glue code, you reach for a decomposition library that does not speak to your execution layer, and the moment a tensor outgrows memory you are writing your own streaming. TenRSo folds all of that into one coherent stack — and this RC sharpens several edges:

Masked einsum, with real subset reductions. masked_einsum now drives contractions over a mask and returns a sparse result, with companions like masked_sum, masked_mean, masked_max, masked_min, and masked_variance for reductions over just the entries you care about.
Executor element-wise operations with automatic parallel dispatch. A new ScalarOp enum, fused tensor-scalar math, and unary/binary element-wise ops that automatically hand off to Rayon once a tensor crosses 10,000 elements — so you stop hand-threading small workloads.
TT-SVD gradients are real now. The backward pass for TT reconstruction replaced a TODO stub with a full implementation, verified numerically against central finite differences.
CP decomposition gains regularization. L1 via soft-thresholding, L2 (Tikhonov/ridge), a hardened non-negative CP-ALS, and cross-validation for rank selection.
A maturity signal you can measure. The full suite went from 2036 to 2109 tests (+73), all passing, with zero compiler and clippy warnings across all targets — and the whole suite now runs 4.8x faster, from roughly 963s down to about 198s, with no test taking longer than 30s.

Technical Deep Dive

TenRSo is a workspace of focused crates, re-exported through the tenrso meta crate as clean modules.

Core and kernels. tenrso-core provides DenseND — dense N-dimensional tensors with axis metadata, views, unfold/fold, and reshape/permute. On top of it, tenrso-kernels supplies the primitives that tensor algebra actually runs on: Khatri-Rao, Kronecker, and Hadamard products, n-mode products, MTTKRP, and the TTM/TTT machinery. These are the building blocks every decomposition and contraction leans on.

Decompositions. tenrso-decomp carries CP-ALS, Tucker-HOOI, and TT-SVD. In this RC the CP path grows up: L1/L2 regularization, a validated non-negative variant, and rank selection by cross-validation. On the TT side, the new TtReconstructionGrad rebuilds the full tensor from TT cores via sequential matrix products and runs a proper backward pass with left/right chain products. As part of the cleanup, the monolithic 3212-line cp.rs was split into a tidy cp/ module (core, advanced, helpers, types, tests).

Sparse and masked einsum. tenrso-sparse spans COO/CSR/BCSR alongside CSC/CSF/HiCOO, with SpMM and SpSpMM. The headline addition is masked einsum: masked_einsum(spec_str, inputs, mask) returns a sparse CooTensor, dispatching to specialized kernels for masked matmul, masked element-wise, and masked outer products, with a generic fallback — and the masked_* reductions let you summarize sparse data without densifying it.

Planner, exec, and out-of-core. tenrso-planner chooses contraction order, selects representations, and decides on tiling/streaming/out-of-core strategies. tenrso-exec is the unified execution API that mixes dense, sparse, and low-rank paths — and its new CpuExecutor::scalar_op(), parallel_elem_op(), parallel_binary_op(), and full_reduce()/parallel_reduce() give you fused, auto-parallel element-wise math. When tensors do not fit in memory, tenrso-ooc streams them via Arrow/Parquet readers with chunked and mmap access. Throughout, the numerical heavy lifting rides on the SciRS2 backend — scirs2-core, scirs2-linalg, and scirs2-fft — which is exactly the division of labor we want: SciRS2 owns matrix-centric linalg, and TenRSo owns the higher-order tensor side.

Getting Started

Add it to your project:

cargo add tenrso

A minimal, copy-pasteable example:

use tenrso::core::{DenseND, TensorHandle};
use tenrso::exec::{einsum_ex, ExecHints};
use tenrso::decomp::{cp_als, InitStrategy};

// A small dense 3-D tensor
let t = DenseND::<f64>::random_uniform(&[16, 16, 16], 0.0, 1.0);

// CP-ALS decomposition (rank 8, 50 iterations)
let cp = cp_als(&t, 8, 50, 1e-4, InitStrategy::Random, None)?;
println!("CP factors: {}", cp.factors.len());

// Planner-optimized einsum contraction (dense/sparse/low-rank auto-mixed)
let a = TensorHandle::from_dense_auto(DenseND::<f64>::ones(&[64, 128]));
let b = TensorHandle::from_dense_auto(DenseND::<f64>::ones(&[128, 32]));
let y = einsum_ex::<f64>("ij,jk->ik")
    .inputs(&[a, b])
    .hints(&ExecHints { prefer_lowrank: true, ..Default::default() })
    .run()?;

The ooc feature is on by default; add ad for autodiff (including the new TT-SVD gradients), or full to pull everything in.

What’s New in 0.1.0-rc.1

Executor element-wise operations (tenrso-exec): a ScalarOp enum (Add, Sub, Mul, Div, Pow) for tensor-scalar ops, scalar_op(), parallel_elem_op() for unary work, parallel_binary_op() for tensor-tensor, and full_reduce()/parallel_reduce() — all auto-dispatching to Rayon above a 10,000-element threshold.
TT-SVD gradient backward pass (tenrso-ad): a complete TtReconstructionGrad replacing the prior TODO stub, with reconstruct() and compute_core_gradients(), numerically verified by central finite differences.
Masked einsum (tenrso-sparse): masked_einsum returning a sparse CooTensor, with specialized kernels plus the masked_sum/masked_mean/masked_max/masked_min/masked_variance/masked_extract family.
CP regularization (tenrso-decomp): L1 soft-thresholding, L2 ridge, a validated non-negative CP-ALS, and cross-validation for rank selection.
Refactoring: the 3212-line cp.rs split into a cp/ module.
Fixed: every slow test (>30s) brought under 10s; clippy cleanups (redundant as f64 casts removed, loop indices replaced with enumerate()); stray Rng import warnings gone.
Changed: all subcrates now use version.workspace = true; the test count rose 2036 → 2109 (+73) while the suite runtime dropped ~963s → ~198s (4.8x faster). Bumped from 0.1.0-alpha.2 to 0.1.0-rc.1.

Quality at this RC: a 100% pass rate across 2109+ tests with all features, zero compiler and clippy warnings on all targets, and no test exceeding 30 seconds.

Tips

Bias the planner. Pass ExecHints { prefer_lowrank: true, .. } (or favor sparse paths) to nudge the executor toward representations that suit your data.
Let the executor parallelize for you. Element-wise ops auto-dispatch to Rayon past 10,000 elements — so do not hand-thread small tensors; reach for parallel_elem_op() / parallel_binary_op() and let the threshold decide.
Reduce over masks, not over everything. For subset statistics on sparse data, combine masked_einsum with masked_sum or masked_mean instead of densifying first.
Fuse tensor-scalar math. Use ScalarOp through CpuExecutor::scalar_op():

use tenrso::exec::{CpuExecutor, ScalarOp};

let exec = CpuExecutor::default();
let scaled = exec.scalar_op(&t, ScalarOp::Mul, 2.0)?;

Turn on gradients when you need them. Enable the ad feature to get the TT-SVD backward pass.
Stream what does not fit. Keep the default ooc feature on and read large tensors straight from Parquet rather than loading them whole.

Approaching first stable

As of early March 2026, TenRSo is the higher-order tensor layer of the COOLJAPAN ecosystem — the complement to SciRS2’s matrix-centric linalg, Pure Rust the whole way down through OxiCode and OxiARC. This is a release candidate, not a finished 1.0: it is the last checkpoint we wanted before stamping 0.1.0. If the RC holds up in your hands the way it holds up in ours, the first stable release follows. We are sharing it now precisely so that final cut earns its label.

Repository: https://github.com/cool-japan/tenrso

Star the repo if you want higher-order tensor computing that you can read, audit, and ship as a single static binary. Sovereign tensors, Pure Rust, no native baggage — that is the whole idea.

— KitaSan at COOLJAPAN OÜ March 6, 2026