OxiBLAS 0.2.0 Released — Recursive Factorizations, Batched BLAS, and no_std

The pure Rust BLAS/LAPACK foundation just grew up.

Today we released OxiBLAS 0.2.0 — the largest update yet to our pure Rust implementation of BLAS and LAPACK. Recursive and parallel factorizations, batched BLAS, runtime auto-tuning, multifrontal sparse solvers, mixed-precision refinement, NUMA-aware memory, and no_std support all land in one release.

No C. No Fortran. No external shared libraries. No FFI overhead. No build hell. Just clean, memory-safe linear algebra that compiles to a single static binary (or WASM) and runs everywhere — now down to no_std + alloc targets.

Why OxiBLAS 0.2.0 is a game changer

The 0.1.x line proved a pure Rust BLAS could match OpenBLAS on dense GEMM. 0.2.0 is about everything that surrounds GEMM in real numerical workloads: factorizations that adapt to cache, solvers that scale across cores, batched kernels for small-matrix workloads, and accuracy that doesn’t force you to pay full double-precision cost.

A few of the headline wins, grounded in this release:

Recursive, cache-oblivious factorizations — Cholesky::compute_recursive(), Lu::compute_recursive(), Qr::compute_recursive() adapt to the cache hierarchy automatically via divide-and-conquer.
Parallel blocked factorizations — Cholesky::compute_blocked_par() and Lu::compute_blocked_par() distribute the Level 3 BLAS updates across threads with rayon.
Real LAPACK speedups — per the README’s comparison tables, Cholesky runs 6–10× and LU 14–23× faster than naive implementations, and the corrected blocked QR (WY representation) now fully realizes its 3–7× speedup on large matrices.
Mixed-precision iterative refinement — mixed_precision_solve (and Cholesky/symmetric/QR variants) factor in f32 and refine the residual in f64, trading a controlled accuracy budget for speed.

Technical Deep Dive: what changed under the hood

0.2.0 deepens every layer of the workspace:

Core (oxiblas-core) New runtime SIMD dispatch infrastructure — SimdCapabilities, SimdDispatcher, KernelSelector, and a simd_dispatch! macro for function multi-versioning. NUMA-aware allocation arrives via NumaVec<T> and MatNuma<T> with real Linux topology detection, plus customizable thread pools (set_global_thread_pool, OxiblasThreadConfig). And crucially, oxiblas-core and oxiblas-matrix now support #![no_std] with alloc.
BLAS (oxiblas-blas) Batched operations — gemm_batched, gemm_strided_batched, axpy_batched, gemv_batched, each with parallel variants. Runtime auto-tuning via RuntimeAutoTuner and the gemm_auto_tuned() convenience function. New SSE4.2 intermediate GEMM micro-kernels (F64x2Sse, F32x4Sse, 4×4 tiles) fill the gap between scalar and AVX2 on older x86_64.
LAPACK (oxiblas-lapack) Recursive and parallel factorizations, complex bidiagonal reduction (ComplexBidiagFactors), and the full mixed-precision refinement family. A new tests/lapack_compat.rs integration suite adds 61 tests across LU, Cholesky, QR, SVD, EVD, and solve.
Sparse (oxiblas-sparse) Multifrontal factorizations — MultifrontalCholesky and MultifrontalLU with elimination-tree construction and supernodal aggregation — plus advanced sparse LU pivoting (SparseLuThreshold, SuperLU-style SparseLuStaticPivot, and Bunch-Kaufman SparseLdlt), and standard test-matrix generators (laplacian_2d/3d, tridiagonal, arrow_matrix, random_spd, poisson_1d).
Tooling & ndarray A performance regression framework (PerfBaseline, RegressionChecker, JSON baselines) with a regress CLI (capture / check / report / list) for CI throughput tracking, parallel matmul_par for ndarray, and sparse interop (array2_to_csr, spmv_ndarray, sparse_solve_ndarray).

This release also marks a milestone for the Pure Rust ecosystem policy: oxiblas-ffi has been retired from the workspace (the directory remains as a deprecated archive). OxiBLAS is now an end-to-end pure Rust stack with zero unwrap() calls in production code and every source file under the 2,000-line limit.

Getting Started

cargo add oxiblas

Recursive, cache-oblivious Cholesky straight from the prelude:

use oxiblas::prelude::*;

// A symmetric positive-definite matrix
let a = Mat::from_rows(&[
    &[4.0, 1.0, 1.0],
    &[1.0, 3.0, 0.0],
    &[1.0, 0.0, 2.0],
]);

// Divide-and-conquer factorization that adapts to the cache hierarchy
let chol = Cholesky::compute_recursive(a.as_ref()).expect("not positive definite");
let l = chol.l_factor(); // lower-triangular factor, A = L * Lᵀ

Track throughput regressions in CI with the bundled regress binary:

# Capture a baseline, then fail the build if performance drops > 5%
cargo run -p oxiblas-benchmarks --bin regress -- capture --output baseline.json
cargo run -p oxiblas-benchmarks --bin regress -- check --baseline baseline.json --threshold 5.0

What’s New in 0.2.0

LAPACK — recursive (compute_recursive) and parallel blocked (compute_blocked_par) Cholesky/LU/QR, complex bidiagonal reduction, mixed-precision iterative refinement (LU, Cholesky, symmetric, QR), and a 61-test LAPACK compatibility suite
BLAS — batched and strided-batched GEMM/GEMV/AXPY with parallel variants, RuntimeAutoTuner + gemm_auto_tuned(), and new SSE4.2 micro-kernels
Sparse — multifrontal Cholesky/LU, threshold/static/Bunch-Kaufman pivoting, standard test-matrix generators, and 27 memory-usage integration tests
Core — runtime SIMD dispatch (SimdDispatcher, simd_dispatch!), NUMA-aware allocation, customizable thread pools, and no_std support for oxiblas-core/oxiblas-matrix
Performance tooling — regression framework with JSON baselines and the regress CLI for CI
Fixed — blocked QR (WY representation) T-matrix construction corrected per the DLARFT specification, fully realizing the 3–7× large-matrix speedup
Pure Rust — oxiblas-ffi retired from the workspace; zero unwrap() in production code; all files under 2,000 lines

The release ships with roughly 169,900 lines of Rust across 359 files, 2,835 passing tests plus 195 doctests.

Tips

Pick the right factorization variant. Use compute_recursive() for cache-adaptive performance on large dense matrices, compute_blocked_par() (with the parallel feature) to scale across cores, and the plain compute() for small problems where overhead dominates — the README’s Algorithm Selection Guide spells out the crossovers.
Batch your small matrices. For many independent small GEMMs, gemm_batched (or gemm_batched_parallel) is far better than a loop of single calls; gemm_strided_batched avoids building slices when your data is laid out contiguously.
Let the tuner pick block sizes. Enable the runtime-tuning feature and call gemm_auto_tuned() to have RuntimeAutoTuner select blocking parameters for your CPU at runtime.
Trade precision for speed deliberately. mixed_precision_solve factors in f32 and refines in f64 — ideal when your conditioning allows it and throughput matters.
Introspect features at compile time. oxiblas::features::{HAS_PARALLEL, HAS_SPARSE, HAS_F128, HAS_RUNTIME_TUNING, ...} let you branch on enabled capabilities without cfg gymnastics.
Guard performance in CI. Wire regress capture then regress check --threshold 5.0 into your pipeline to catch throughput regressions automatically.

This is the foundation

By March 2026 the COOLJAPAN scientific stack is in full bloom, and OxiBLAS is its mathematical bedrock:

SciRS2 / NumRS2 — all core numerical operations
PandRS — dataframe numerics
OptiRS — optimization routines
ToRSh — high-performance tensor math
Future integration with OxiLean for formally verified numerics

OxiBLAS 0.2.0 makes that foundation faster, broader, and — with the FFI retired — completely pure Rust, top to bottom.

Repository: https://github.com/cool-japan/oxiblas

Star the repo if you want high-performance scientific computing without the traditional toolchain headaches.

The era of “just link OpenBLAS” is ending.

Pure Rust numerical linear algebra is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ March 7, 2026