COOLJAPAN
← All posts

OxiFFT 0.3.0 Released — ~4× faster DCT, FFTW parity gates, GPU batch & pencil 3D

Pure Rust FFT and the rustfft replacement. OxiFFT 0.3.0 lands an FFT-based Makhoul DCT (~4× flop reduction), a 7-gate FFTW parity harness, GPU batch FFT with auto-chunking, 3D pencil MPI, a cache-oblivious 4-step transform, and real WASM SIMD v128 — 1360 tests passing, default build still 100% Rust.

release oxifft fft fftw dct gpu simd mpi rust rustfft

The spectral backbone of the Rust numerical stack just hit FFTW speed — without a single line of C.

Today we released OxiFFT 0.3.0 — a Pure Rust port of FFTW3 that now ships an FFT-based DCT, a seven-gate FFTW parity harness, GPU batch transforms, and 3D pencil-decomposed MPI, all behind a default build that never leaves Rust.

No C. No Fortran. No FFTW. No cuFFT. No FFI. OxiFFT is Pure Rust to the metal: its default features are 100% Rust, it compiles to a single static binary or to WebAssembly, and it carries no build.rs that shells out to a system toolchain. It is the COOLJAPAN Pure Rust answer to the spectral layer — the drop-in rustfft replacement that simultaneously displaces the heavyweight incumbents FFTW3 and cuFFT. The GPU backends are an opt-in feature; the build you get out of the box is pure, portable, and self-contained.

Why OxiFFT 0.3.0 is a game changer

For two decades, fast transforms in production meant FFTW: a brilliant C library whose price of admission is a C build, autoconf, codelet generation, and an FFI boundary you have to babysit on every platform you ship to. On the Rust side, rustfft gave us complex FFTs in safe code — but there was no credible DCT story, the transform that underpins JPEG, MP3/AAC, and half of audio DSP. OxiFFT 0.3.0 closes that gap, and then keeps going.

The concrete wins:

For reference, the pre-Makhoul FFTW ratio on the v0.2.0 baseline was 7.39× on the DCT gate — the new FFT-based path is what carves that down by roughly 4×.

Technical Deep Dive

Makhoul DCT/DST — the headline

The DCT-II/III/IV transforms are no longer computed by embedding the signal into a 2N- or 4N-point complex DFT. In 0.3.0 they go through Makhoul’s reduction: an N-point real-to-complex FFT followed by an O(N) post-twiddle pass. That is the whole trick — one real FFT does the heavy lifting, and a linear twiddle loop reshapes the spectrum into DCT coefficients. The result is a ~4× flop reduction versus v0.2.0.

The DCT-II default path is FFT-based for n ≥ 16 (O(n log n)), and the original direct O(n²) solver is retained as a reference fallback for n < 16, where the asymptotics don’t pay off and the direct loop is actually simpler and just as fast. The implementation lives in oxifft/src/rdft/solvers/r2r.rs.

The second half of the DCT story is plan caching. R2rPlan now caches its R2rSolver at construction, so the twiddle tables and the inner FFT plans are built once and reused on every execute(). Before this, a single dct2_1024 call constructed two Plans and recomputed its trig tables every time — concretely, 2 Plan constructions plus 2561 sin_cos calls per dct2_1024 call were being thrown away and redone. Now they survive the call. Reuse the plan and you pay that setup cost exactly once.

FFTW parity gates

“FFTW-fast” is only meaningful if it is measured. 0.3.0 ships a benchmark harness of 7 parity gates at oxifft-bench/benches/fftw_parity_gates.rs:

  1. 1024-point complex FFT
  2. 2^20-point complex FFT
  3. 1024-point real FFT
  4. 1024×1024 2D FFT
  5. 1000×256 batched FFT
  6. 2017-point prime FFT (the Bluestein/Rader path)
  7. 1024-point DCT-II

Each gate has a committed ratio baseline under oxifft-bench/benches/baselines/v0.3.0/, so you can run the suite on your own CPU and see exactly where OxiFFT sits against FFTW for your hardware and workload mix — no hand-waving.

Two pieces of plumbing make the prime-size and batched gates competitive. First, SIMD pointwise-multiply helpers for the Bluestein/Rader convolutions live in kernel/complex_mul.rs: complex_mul_aos_f64 / complex_mul_aos_f32 dispatch across AVX2+FMA / NEON / SSE2 / scalar at runtime. Second, the scratch buffers are now thread-local, keyed by solver ID, which removes the mutex contention that used to serialize parallel transforms sharing a plan.

GPU & distributed

OxiFFT 0.3.0 grows real teeth on accelerators. The new gpu/batch.rs introduces a GpuBatchFft<T> trait for running N independent same-size FFTs in a single GPU submission — exactly the shape that spectrogram pipelines and per-channel audio processing want. Batches larger than the device limit are chunked automatically under METAL_BATCH_LIMIT=1024 and CUDA_BATCH_LIMIT=4096.

Device discovery is now real rather than a placeholder. Metal is probed via oxicuda_metal::device::MetalDevice::new() and CUDA via oxicuda_driver::init(), replacing the old hardcoded / filesystem-check stubs. GPU kernel dispatch currently falls back to CPU pending the oxicuda-launch integration, so the path is wired end-to-end and correct today, with the device-side kernels landing next.

These GPU backends ride on COOLJAPAN’s own OxiCUDA stack — oxicuda-driver, oxicuda-fft, and oxicuda-metal — and are gated behind the cuda / metal / gpu features. Nothing about the default build touches a GPU runtime.

For distributed spectral work, pencil decomposition for 3D MPI FFT lands in mpi/plans/plan_3d_pencil.rs, replacing slab-only decomposition for transforms that need to scale past a single dimension of process count.

Portability & rigor

Beyond raw speed, 0.3.0 tightens the foundations:

On documentation rigor: every unsafe function carries a # Safety note and every fallible function carries an # Errors note — 84+ unsafe fns and 84+ fallible fns documented — enforced at the crate level via #![warn(clippy::missing_safety_doc)] and missing_errors_doc.

Getting Started

Add OxiFFT to your project:

cargo add oxifft

Then reach for the headline feature — a Pure Rust DCT-II:

use oxifft::dct2;

// FFT-based DCT-II (O(n log n) via Makhoul) — the JPEG/audio transform, in Pure Rust
let signal: Vec<f64> = (0..1024).map(|n| (n as f64 * 0.01).cos()).collect();
let coeffs: Vec<f64> = dct2(&signal);
println!("DC term = {}", coeffs[0]);

GPU batch transforms are opt-in:

cargo add oxifft --features gpu

What’s New in 0.3.0

Added

Changed

Removed

Fixed

Performance

Tips

// Build once, execute many — setup cost paid a single time.
let plan = oxifft::R2rPlan::new_dct2(1024)?;
for frame in &frames {
    let coeffs = plan.execute(frame);
    // ...
}

The foundation

OxiFFT is the spectral layer of the COOLJAPAN ecosystem. By late April 2026 it sits beside a roster of mature siblings: SciRS2 and NumRS2 for scientific and array computing, OxiBLAS for linear algebra, OxiCUDA — its actual GPU backend, via oxicuda-driver / oxicuda-fft / oxicuda-metal — for accelerator dispatch, and the ML stack of ToRSh, TenfloweRS, TrustformeRS, and SkleaRS. It also underpins signal and audio work in OxiWhisper and physical simulation in OxiPhysics. Every FFT, DCT, and spectral solve in that stack can now run on a Pure Rust foundation, with GPU acceleration available the moment you flip a feature flag.

Repository: https://github.com/cool-japan/oxifft

Star the repo if you believe fast transforms shouldn’t require a C toolchain.

Pure Rust spectral computing — fast, safe, sovereign, and now FFTW-fast.

KitaSan at COOLJAPAN OÜ April 25, 2026

↑ Back to all posts