The pure Rust BLAS/LAPACK foundation just grew up.
Today we released OxiBLAS 0.2.0 — the largest update yet to our pure Rust implementation of BLAS and LAPACK. Recursive and parallel factorizations, batched BLAS, runtime auto-tuning, multifrontal sparse solvers, mixed-precision refinement, NUMA-aware memory, and no_std support all land in one release.
No C. No Fortran. No external shared libraries.
No FFI overhead. No build hell.
Just clean, memory-safe linear algebra that compiles to a single static binary (or WASM) and runs everywhere — now down to no_std + alloc targets.
Why OxiBLAS 0.2.0 is a game changer
The 0.1.x line proved a pure Rust BLAS could match OpenBLAS on dense GEMM. 0.2.0 is about everything that surrounds GEMM in real numerical workloads: factorizations that adapt to cache, solvers that scale across cores, batched kernels for small-matrix workloads, and accuracy that doesn’t force you to pay full double-precision cost.
A few of the headline wins, grounded in this release:
- Recursive, cache-oblivious factorizations —
Cholesky::compute_recursive(),Lu::compute_recursive(),Qr::compute_recursive()adapt to the cache hierarchy automatically via divide-and-conquer. - Parallel blocked factorizations —
Cholesky::compute_blocked_par()andLu::compute_blocked_par()distribute the Level 3 BLAS updates across threads with rayon. - Real LAPACK speedups — per the README’s comparison tables, Cholesky runs 6–10× and LU 14–23× faster than naive implementations, and the corrected blocked QR (WY representation) now fully realizes its 3–7× speedup on large matrices.
- Mixed-precision iterative refinement —
mixed_precision_solve(and Cholesky/symmetric/QR variants) factor in f32 and refine the residual in f64, trading a controlled accuracy budget for speed.
Technical Deep Dive: what changed under the hood
0.2.0 deepens every layer of the workspace:
-
Core (
oxiblas-core) New runtime SIMD dispatch infrastructure —SimdCapabilities,SimdDispatcher,KernelSelector, and asimd_dispatch!macro for function multi-versioning. NUMA-aware allocation arrives viaNumaVec<T>andMatNuma<T>with real Linux topology detection, plus customizable thread pools (set_global_thread_pool,OxiblasThreadConfig). And crucially,oxiblas-coreandoxiblas-matrixnow support#![no_std]withalloc. -
BLAS (
oxiblas-blas) Batched operations —gemm_batched,gemm_strided_batched,axpy_batched,gemv_batched, each with parallel variants. Runtime auto-tuning viaRuntimeAutoTunerand thegemm_auto_tuned()convenience function. New SSE4.2 intermediate GEMM micro-kernels (F64x2Sse,F32x4Sse, 4×4 tiles) fill the gap between scalar and AVX2 on older x86_64. -
LAPACK (
oxiblas-lapack) Recursive and parallel factorizations, complex bidiagonal reduction (ComplexBidiagFactors), and the full mixed-precision refinement family. A newtests/lapack_compat.rsintegration suite adds 61 tests across LU, Cholesky, QR, SVD, EVD, and solve. -
Sparse (
oxiblas-sparse) Multifrontal factorizations —MultifrontalCholeskyandMultifrontalLUwith elimination-tree construction and supernodal aggregation — plus advanced sparse LU pivoting (SparseLuThreshold, SuperLU-styleSparseLuStaticPivot, and Bunch-KaufmanSparseLdlt), and standard test-matrix generators (laplacian_2d/3d,tridiagonal,arrow_matrix,random_spd,poisson_1d). -
Tooling & ndarray A performance regression framework (
PerfBaseline,RegressionChecker, JSON baselines) with aregressCLI (capture/check/report/list) for CI throughput tracking, parallelmatmul_parfor ndarray, and sparse interop (array2_to_csr,spmv_ndarray,sparse_solve_ndarray).
This release also marks a milestone for the Pure Rust ecosystem policy: oxiblas-ffi has been retired from the workspace (the directory remains as a deprecated archive). OxiBLAS is now an end-to-end pure Rust stack with zero unwrap() calls in production code and every source file under the 2,000-line limit.
Getting Started
cargo add oxiblas
Recursive, cache-oblivious Cholesky straight from the prelude:
use oxiblas::prelude::*;
// A symmetric positive-definite matrix
let a = Mat::from_rows(&[
&[4.0, 1.0, 1.0],
&[1.0, 3.0, 0.0],
&[1.0, 0.0, 2.0],
]);
// Divide-and-conquer factorization that adapts to the cache hierarchy
let chol = Cholesky::compute_recursive(a.as_ref()).expect("not positive definite");
let l = chol.l_factor(); // lower-triangular factor, A = L * Lᵀ
Track throughput regressions in CI with the bundled regress binary:
# Capture a baseline, then fail the build if performance drops > 5%
cargo run -p oxiblas-benchmarks --bin regress -- capture --output baseline.json
cargo run -p oxiblas-benchmarks --bin regress -- check --baseline baseline.json --threshold 5.0
What’s New in 0.2.0
- LAPACK — recursive (
compute_recursive) and parallel blocked (compute_blocked_par) Cholesky/LU/QR, complex bidiagonal reduction, mixed-precision iterative refinement (LU, Cholesky, symmetric, QR), and a 61-test LAPACK compatibility suite - BLAS — batched and strided-batched GEMM/GEMV/AXPY with parallel variants,
RuntimeAutoTuner+gemm_auto_tuned(), and new SSE4.2 micro-kernels - Sparse — multifrontal Cholesky/LU, threshold/static/Bunch-Kaufman pivoting, standard test-matrix generators, and 27 memory-usage integration tests
- Core — runtime SIMD dispatch (
SimdDispatcher,simd_dispatch!), NUMA-aware allocation, customizable thread pools, andno_stdsupport foroxiblas-core/oxiblas-matrix - Performance tooling — regression framework with JSON baselines and the
regressCLI for CI - Fixed — blocked QR (WY representation) T-matrix construction corrected per the DLARFT specification, fully realizing the 3–7× large-matrix speedup
- Pure Rust —
oxiblas-ffiretired from the workspace; zerounwrap()in production code; all files under 2,000 lines
The release ships with roughly 169,900 lines of Rust across 359 files, 2,835 passing tests plus 195 doctests.
Tips
- Pick the right factorization variant. Use
compute_recursive()for cache-adaptive performance on large dense matrices,compute_blocked_par()(with theparallelfeature) to scale across cores, and the plaincompute()for small problems where overhead dominates — the README’s Algorithm Selection Guide spells out the crossovers. - Batch your small matrices. For many independent small GEMMs,
gemm_batched(orgemm_batched_parallel) is far better than a loop of single calls;gemm_strided_batchedavoids building slices when your data is laid out contiguously. - Let the tuner pick block sizes. Enable the
runtime-tuningfeature and callgemm_auto_tuned()to haveRuntimeAutoTunerselect blocking parameters for your CPU at runtime. - Trade precision for speed deliberately.
mixed_precision_solvefactors in f32 and refines in f64 — ideal when your conditioning allows it and throughput matters. - Introspect features at compile time.
oxiblas::features::{HAS_PARALLEL, HAS_SPARSE, HAS_F128, HAS_RUNTIME_TUNING, ...}let you branch on enabled capabilities withoutcfggymnastics. - Guard performance in CI. Wire
regress capturethenregress check --threshold 5.0into your pipeline to catch throughput regressions automatically.
This is the foundation
By March 2026 the COOLJAPAN scientific stack is in full bloom, and OxiBLAS is its mathematical bedrock:
- SciRS2 / NumRS2 — all core numerical operations
- PandRS — dataframe numerics
- OptiRS — optimization routines
- ToRSh — high-performance tensor math
- Future integration with OxiLean for formally verified numerics
OxiBLAS 0.2.0 makes that foundation faster, broader, and — with the FFI retired — completely pure Rust, top to bottom.
Repository: https://github.com/cool-japan/oxiblas
Star the repo if you want high-performance scientific computing without the traditional toolchain headaches.
The era of “just link OpenBLAS” is ending.
Pure Rust numerical linear algebra is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ March 7, 2026