OxiWhisper 0.1.1 Released — GGUF Support, Parallel Attention, and Stable Word Timestamps

The Pure Rust Whisper engine just got a bigger appetite: GGUF models, more audio formats, and word timestamps you can trust.

Today we released OxiWhisper 0.1.1 — an incremental update that teaches the loader the modern GGUF format, adds optional multi-core attention, halves KV-cache memory with f16, and graduates word-level timestamps to Stable.

The ground rule has not moved. OxiWhisper is still a Pure Rust OpenAI Whisper inference engine with no C, no C++, and no Python in the build. While the alternatives — whisper.cpp as a vendored C++ tree, Python + PyTorch with libtorch behind it, and ONNX Runtime as a native provider — keep their foreign-language dependencies, OxiWhisper stays a single Rust crate that compiles to one static binary or to WebAssembly. 0.1.1 widens what it can read and how fast it can run without spending any of that sovereignty.

Why OxiWhisper 0.1.1

If 0.1.0 proved the engine was real, 0.1.1 is about meeting the formats and hardware people actually have. The headline is GGUF: the community has largely moved from the legacy ggml-*.bin files to *.gguf, and you should not have to think about which one you grabbed. There is also a simple performance gap — single-threaded decoding leaves cores idle on a laptop, and large models hold more RSS than they need to. This release closes both.

The numbers grew with the surface area:

17,101 lines of Rust across 24 modules, with 10 examples still shipped.
399 tests, up from 278.
Word-level timestamps graduated from Alpha to Stable, now backed by real banded DTW.
f16 KV-cache that trims roughly 25–50% of cache memory.
Memory-mapped loading and an optional parallel feature that spreads attention across cores.

All of that builds on the same foundation as before — the Arc copy-on-write KV cache that already saved about 4.5 GB of beam-search allocations, and quantized models where tiny still drops from roughly 150 MB to about 40 MB at Q4_0.

Technical Deep Dive

The architecture is unchanged; the pipeline still runs the genuine Whisper encoder-decoder:

Audio -> Mel Spectrogram (OxiFFT) -> Encoder (Conv + Transformer)
-> Decoder (Autoregressive + KV Cache + Beam Search) -> Tokenizer -> Text

What changed sits at several points along that path. At the very front, load_audio() in the audio layer is a new magic-byte auto-detecting loader: behind the symphonia-backed audio-flac, audio-ogg, audio-mp3, audio-aac, and audio-opus features (or audio-all for everything), it decodes those containers directly — no external ffmpeg, all still Rust. The mel front end remains OxiFFT, now upgraded to oxifft 0.3.

At model load time, WhisperModel::from_file() and the new from_file_mmap() inspect magic bytes and transparently accept both legacy GGML and modern GGUF; the API did not change, you simply point at a different path. from_file_mmap() uses memmap2 to back the weights with a memory mapping, which lowers peak RSS noticeably on large models.

Inside the decoder, the SDPA hot path was rewritten. The scalar triple-loops are gone, replaced by matrixmultiply::sgemm, and on the encoder side the attention scratch allocations were hoisted out of the per-head loops so they are not reallocated every head. With the optional parallel feature enabled, the per-head SDPA loops in the decoder and the encoder attention fan out across a rayon pool that threading::set_thread_count(n) configures; it is off by default precisely so WASM and single-threaded builds are untouched. The KV cache gained dtype control through KvCacheDtype { F32, VHalf, KvHalf }, so you can store keys and/or values in f16.

The biggest accuracy change is in alignment. align_tokens_dtw now implements a real Sakoe-Chiba-banded dynamic-programming DTW with traceback, replacing the old monotonic-peak approximation — the result is smoother, more accurate word timestamps when the cross-attention is noisy. The previous behavior lives on as align_tokens_monotonic_peak(), kept as a deprecated shim for SemVer. The tokenizer also got more correct: parse_json_string now decodes UTF-16 surrogate pairs properly, so emoji, Mathematical Alphanumeric Symbols, and CJK Extension B characters survive a round trip through tokenizer.json, while a lone high surrogate such as \uD800 returns an error instead of being silently dropped. Under the hood, quantize.rs was refactored into a src/quantize/ directory of 7 modules, each under 500 lines, with the public API preserved, and a new tests/ directory adds 5 integration binaries gated where appropriate by a test-utils feature for the synthetic model generator.

Getting Started

cargo add oxiwhisper

use oxiwhisper::{WhisperModel, TranscribeOptions};
use std::path::Path;

fn main() -> Result<(), oxiwhisper::OxiWhisperError> {
    // Same API for ggml-*.bin and *.gguf — the loader auto-detects from magic bytes.
    let model = WhisperModel::from_file_mmap(Path::new("ggml-base.gguf"))?;
    let audio = oxiwhisper::audio::load_audio(Path::new("podcast.mp3"))?; // needs audio-mp3 feature
    let result = model.transcribe_long_with_vad(&audio, &TranscribeOptions::default(), Default::default())?;
    println!("{}", result.text);
    Ok(())
}

Two things to notice: the path ends in .gguf and nothing else changed, and the audio is an .mp3 decoded through load_audio() with the audio-mp3 feature — no ffmpeg in sight. If you want the simplest possible long-audio call, model.transcribe_long(&audio, &opts) works just as well.

What’s New in 0.1.1

GGUF format support. WhisperModel::from_file() and from_file_mmap() auto-detect magic bytes and transparently accept both legacy GGML (ggml-*.bin) and modern GGUF (*.gguf). No API change — just change the path. (src/model.rs)
parallel feature (optional, not default). Per-head parallelism in the decoder SDPA loops and encoder attention via rayon; threading::set_thread_count(n) configures the pool. Off by default so WASM and single-threaded builds are unaffected. (src/threading.rs, src/decoder/sdpa.rs, src/encoder.rs)
Memory-mapped loading. WhisperModel::from_file_mmap() via memmap2 — lower peak RSS for large models.
f16 KV-cache. KvCacheDtype { F32, VHalf, KvHalf } for roughly 25–50% KV-cache memory savings.
More audio formats. load_audio() is a magic-byte auto-detecting loader; FLAC/OGG/MP3/AAC/Opus come from the audio-flac/audio-ogg/audio-mp3/audio-aac/audio-opus features (symphonia-backed; audio-all enables all).
True banded DTW. align_tokens_dtw now does Sakoe-Chiba-banded dynamic-programming DTW with traceback, replacing the monotonic-peak approximation, for smoother and more accurate word timestamps on noisy attention. align_tokens_monotonic_peak() is the renamed older path (a deprecated shim is kept for SemVer).
Changed. Decoder SDPA hot-path migrated from scalar triple-loops to matrixmultiply::sgemm; encoder attention scratch allocations hoisted out of the head loops; oxifft upgraded to 0.3.
Fixed. parse_json_string now correctly decodes UTF-16 surrogate pairs (emoji, Mathematical Alphanumeric Symbols, CJK Extension B) from tokenizer.json; a lone high surrogate (\uD800) now returns Err instead of being silently dropped. (src/tokenizer.rs)
Refactor. quantize.rs split into src/quantize/ (7 modules, each under 500 lines), all public API preserved; a new tests/ directory adds 5 integration binaries, and a new test-utils feature gates the synthetic model generator.
Known issue. With the onnx feature enabled, Cargo.lock transiently holds both oxifft 0.2.0 (via oxionnx-ops) and oxifft 0.3.0 (direct) until oxionnx-ops upgrades; there is zero impact when onnx is disabled, which is the default.

Tips

Point the loader at a .gguf and forget about it. from_file() and from_file_mmap() both auto-detect, so the same code reads legacy and modern checkpoints. Prefer from_file_mmap() on large models to keep peak RSS low.
Turn on parallel to use your cores. Build with features = ["parallel"] and call threading::set_thread_count(n) to spread decoder and encoder attention across the machine. Leave it off for WASM and single-threaded targets.
Shrink the KV cache with f16. Switch to KvCacheDtype::KvHalf to save roughly 25–50% of cache memory — useful when long-audio decoding pushes memory.
Decode real-world audio without ffmpeg. Enable an audio-* feature (for example audio-mp3) and call load_audio() to transcribe FLAC, OGG, MP3, AAC, or Opus directly.
Word timestamps are Stable now. They use real banded DTW, so transcribe_timed() gives noticeably smoother alignment, especially on noisy attention.
Mind the onnx lockfile note. If you enable the onnx feature, expect a transient duplicate oxifft in Cargo.lock — it is harmless, and disabling onnx avoids it entirely.

This is the foundation

OxiWhisper sits in the COOLJAPAN Pure Rust stack as its speech-recognition layer, and its anchor is still OxiFFT — the dependency that computes the log-mel spectrogram and now rides at version 0.3. SciRS2 and NumRS2 provide the numerical bedrock, and VoiRS, the audio and speech sibling, pairs naturally with OxiWhisper when a project needs text-to-speech on the same Pure Rust footing as speech-to-text. Looking ahead, GPU acceleration is the obvious next frontier for inference of this shape — the broader ecosystem has been building toward that with OxiCUDA — though OxiWhisper today depends on none of it and remains CPU-and-WASM Pure Rust. The direction is clear: a complete, sovereign audio pipeline with no foreign-language runtime anywhere in sight.

Repository: https://github.com/cool-japan/oxiwhisper

Star the repo if a Pure Rust Whisper that reads GGUF and decodes MP3 out of the box is your kind of tool.

Pure Rust speech recognition is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ April 26, 2026