OxiFFT 0.3.0 Released — ~4× faster DCT, FFTW parity gates, GPU batch & pencil 3D

The spectral backbone of the Rust numerical stack just hit FFTW speed — without a single line of C.

Today we released OxiFFT 0.3.0 — a Pure Rust port of FFTW3 that now ships an FFT-based DCT, a seven-gate FFTW parity harness, GPU batch transforms, and 3D pencil-decomposed MPI, all behind a default build that never leaves Rust.

No C. No Fortran. No FFTW. No cuFFT. No FFI. OxiFFT is Pure Rust to the metal: its default features are 100% Rust, it compiles to a single static binary or to WebAssembly, and it carries no build.rs that shells out to a system toolchain. It is the COOLJAPAN Pure Rust answer to the spectral layer — the drop-in rustfft replacement that simultaneously displaces the heavyweight incumbents FFTW3 and cuFFT. The GPU backends are an opt-in feature; the build you get out of the box is pure, portable, and self-contained.

Why OxiFFT 0.3.0 is a game changer

For two decades, fast transforms in production meant FFTW: a brilliant C library whose price of admission is a C build, autoconf, codelet generation, and an FFI boundary you have to babysit on every platform you ship to. On the Rust side, rustfft gave us complex FFTs in safe code — but there was no credible DCT story, the transform that underpins JPEG, MP3/AAC, and half of audio DSP. OxiFFT 0.3.0 closes that gap, and then keeps going.

The concrete wins:

~4× faster DCT-II @ 1024 vs v0.2.0. The DCT/DST family is now FFT-based via Makhoul reduction — O(n log n) instead of the old O(n²) complex-DFT approach.
A 7-gate FFTW parity harness. Seven representative workloads, each with a committed ratio baseline, so “competitive with FFTW” is something you can re-measure, not something we just claim.
GPU batch FFT with automatic chunking. Submit N independent same-size FFTs in one GPU dispatch; OxiFFT chunks them under CUDA_BATCH_LIMIT=4096 / METAL_BATCH_LIMIT=1024 for you.
Pencil-decomposed 3D MPI FFT for distributed spectral solvers.
Cache-oblivious Frigo–Johnson 4-step FFT for large 1D transforms that blow past cache.
Real WASM SIMD v128 via core::arch::wasm32, plus a test count that jumped to 1360 tests passing (up from 858 in v0.2.0).

For reference, the pre-Makhoul FFTW ratio on the v0.2.0 baseline was 7.39× on the DCT gate — the new FFT-based path is what carves that down by roughly 4×.

Technical Deep Dive

Makhoul DCT/DST — the headline

The DCT-II/III/IV transforms are no longer computed by embedding the signal into a 2N- or 4N-point complex DFT. In 0.3.0 they go through Makhoul’s reduction: an N-point real-to-complex FFT followed by an O(N) post-twiddle pass. That is the whole trick — one real FFT does the heavy lifting, and a linear twiddle loop reshapes the spectrum into DCT coefficients. The result is a ~4× flop reduction versus v0.2.0.

The DCT-II default path is FFT-based for n ≥ 16 (O(n log n)), and the original direct O(n²) solver is retained as a reference fallback for n < 16, where the asymptotics don’t pay off and the direct loop is actually simpler and just as fast. The implementation lives in oxifft/src/rdft/solvers/r2r.rs.

The second half of the DCT story is plan caching. R2rPlan now caches its R2rSolver at construction, so the twiddle tables and the inner FFT plans are built once and reused on every execute(). Before this, a single dct2_1024 call constructed two Plans and recomputed its trig tables every time — concretely, 2 Plan constructions plus 2561 sin_cos calls per dct2_1024 call were being thrown away and redone. Now they survive the call. Reuse the plan and you pay that setup cost exactly once.

FFTW parity gates

“FFTW-fast” is only meaningful if it is measured. 0.3.0 ships a benchmark harness of 7 parity gates at oxifft-bench/benches/fftw_parity_gates.rs:

1024-point complex FFT
2^20-point complex FFT
1024-point real FFT
1024×1024 2D FFT
1000×256 batched FFT
2017-point prime FFT (the Bluestein/Rader path)
1024-point DCT-II

Each gate has a committed ratio baseline under oxifft-bench/benches/baselines/v0.3.0/, so you can run the suite on your own CPU and see exactly where OxiFFT sits against FFTW for your hardware and workload mix — no hand-waving.

Two pieces of plumbing make the prime-size and batched gates competitive. First, SIMD pointwise-multiply helpers for the Bluestein/Rader convolutions live in kernel/complex_mul.rs: complex_mul_aos_f64 / complex_mul_aos_f32 dispatch across AVX2+FMA / NEON / SSE2 / scalar at runtime. Second, the scratch buffers are now thread-local, keyed by solver ID, which removes the mutex contention that used to serialize parallel transforms sharing a plan.

GPU & distributed

OxiFFT 0.3.0 grows real teeth on accelerators. The new gpu/batch.rs introduces a GpuBatchFft<T> trait for running N independent same-size FFTs in a single GPU submission — exactly the shape that spectrogram pipelines and per-channel audio processing want. Batches larger than the device limit are chunked automatically under METAL_BATCH_LIMIT=1024 and CUDA_BATCH_LIMIT=4096.

Device discovery is now real rather than a placeholder. Metal is probed via oxicuda_metal::device::MetalDevice::new() and CUDA via oxicuda_driver::init(), replacing the old hardcoded / filesystem-check stubs. GPU kernel dispatch currently falls back to CPU pending the oxicuda-launch integration, so the path is wired end-to-end and correct today, with the device-side kernels landing next.

These GPU backends ride on COOLJAPAN’s own OxiCUDA stack — oxicuda-driver, oxicuda-fft, and oxicuda-metal — and are gated behind the cuda / metal / gpu features. Nothing about the default build touches a GPU runtime.

For distributed spectral work, pencil decomposition for 3D MPI FFT lands in mpi/plans/plan_3d_pencil.rs, replacing slab-only decomposition for transforms that need to scale past a single dimension of process count.

Portability & rigor

Beyond raw speed, 0.3.0 tightens the foundations:

Cache-oblivious 4-step FFT (Frigo–Johnson) in dft/solvers/cache_oblivious.rs — large 1D transforms that exceed cache no longer thrash.
Real WASM SIMD v128 via core::arch::wasm32 (wasm/simd.rs), with a module-split scalar fallback so non-simd128 targets still build and run.
Work-stealing scheduler for Plan2D / Plan3D in threading/work_stealing.rs, with a user-pool override if you want to bring your own threads.
Send + Sync compile-time assertions on every public plan type in assertions.rs — a thread-safety regression now fails the build, not your service.
SVE detection moved to std::arch::is_aarch64_feature_detected!("sve"); the libc dependency was removed entirely.
Production .unwrap() calls were stripped from rader_omega.rs, spectral.rs, and threading/mod.rs.

On documentation rigor: every unsafe function carries a # Safety note and every fallible function carries an # Errors note — 84+ unsafe fns and 84+ fallible fns documented — enforced at the crate level via #![warn(clippy::missing_safety_doc)] and missing_errors_doc.

Getting Started

Add OxiFFT to your project:

cargo add oxifft

Then reach for the headline feature — a Pure Rust DCT-II:

use oxifft::dct2;

// FFT-based DCT-II (O(n log n) via Makhoul) — the JPEG/audio transform, in Pure Rust
let signal: Vec<f64> = (0..1024).map(|n| (n as f64 * 0.01).cos()).collect();
let coeffs: Vec<f64> = dct2(&signal);
println!("DC term = {}", coeffs[0]);

GPU batch transforms are opt-in:

cargo add oxifft --features gpu

What’s New in 0.3.0

Added

FFT-based Makhoul DCT-II/III/IV and DST (N-point R2C + O(N) post-twiddle).
7-gate FFTW parity benchmark harness with committed v0.3.0 ratio baselines.
GpuBatchFft<T> trait and gpu/batch.rs for batched GPU FFT with auto-chunking.
Real Metal/CUDA device probes via oxicuda_metal and oxicuda_driver.
Pencil decomposition for 3D MPI FFT (mpi/plans/plan_3d_pencil.rs).
Cache-oblivious Frigo–Johnson 4-step FFT.
Real WASM SIMD v128 path with a scalar fallback module.
Work-stealing scheduler for Plan2D / Plan3D with user-pool override.
SIMD complex_mul_aos_f64 / _f32 helpers (AVX2+FMA / NEON / SSE2 / scalar).
# Safety and # Errors rustdoc across 84+ unsafe and 84+ fallible functions.

Changed

R2rPlan caches its R2rSolver at construction (plan + twiddle reuse across calls).
DCT-II default path is FFT-based for n ≥ 16; direct O(n²) solver retained for n < 16.
Scratch buffers are now thread-local, keyed by solver ID (no mutex contention).
SVE detection via std::arch::is_aarch64_feature_detected!.

Removed

The libc dependency.
Production .unwrap() from rader_omega.rs, spectral.rs, threading/mod.rs.

Fixed

Hardcoded / filesystem-check GPU device placeholders replaced with real probes.

Performance

~4× flop reduction on DCT-II @ 1024 vs v0.2.0 (O(n log n) vs O(n²)).
2 Plan constructions + 2561 sin_cos calls eliminated per dct2_1024 call.
1360 tests passing (up from 858 in v0.2.0).

Tips

Reuse your R2rPlan across calls. Construction is where the twiddle tables and inner FFT plans are built; hold the plan and you amortize that setup over every execute() instead of rebuilding it each time.

// Build once, execute many — setup cost paid a single time.
let plan = oxifft::R2rPlan::new_dct2(1024)?;
for frame in &frames {
    let coeffs = plan.execute(frame);
    // ...
}

Below n = 16, the direct O(n²) DCT is intentional. There is no log n win at those sizes, so OxiFFT deliberately uses the simpler direct solver — don’t be surprised to see it on the small path.
Run the FFTW parity gates on your own CPU. Speedup is hardware-specific; cargo bench --bench fftw_parity_gates against the committed v0.3.0 baselines tells you where you actually stand.
Mind the GPU batch limits. Batches are chunked under CUDA_BATCH_LIMIT=4096 and METAL_BATCH_LIMIT=1024; sizing your submissions around those multiples avoids ragged final chunks.
#[non_exhaustive] enums still need a _ => arm. Carried over from 0.2.0: matching on OxiFFT’s public enums requires a catch-all so future variants don’t break your build.

The foundation

OxiFFT is the spectral layer of the COOLJAPAN ecosystem. By late April 2026 it sits beside a roster of mature siblings: SciRS2 and NumRS2 for scientific and array computing, OxiBLAS for linear algebra, OxiCUDA — its actual GPU backend, via oxicuda-driver / oxicuda-fft / oxicuda-metal — for accelerator dispatch, and the ML stack of ToRSh, TenfloweRS, TrustformeRS, and SkleaRS. It also underpins signal and audio work in OxiWhisper and physical simulation in OxiPhysics. Every FFT, DCT, and spectral solve in that stack can now run on a Pure Rust foundation, with GPU acceleration available the moment you flip a feature flag.

Repository: https://github.com/cool-japan/oxifft

Star the repo if you believe fast transforms shouldn’t require a C toolchain.

Pure Rust spectral computing — fast, safe, sovereign, and now FFTW-fast.

— KitaSan at COOLJAPAN OÜ April 25, 2026