COOLJAPAN
← All posts

OxiWhisper 0.1.1 Released — GGUF Support, Parallel Attention, and Stable Word Timestamps

OxiWhisper 0.1.1 adds transparent GGUF loading, an optional rayon-parallel attention feature, memory-mapped models, f16 KV-cache, FLAC/OGG/MP3/AAC/Opus decoding, and true banded-DTW word timestamps — still Pure Rust, no C or Python.

release oxiwhisper whisper speech-to-text asr rust audio transcription gguf

The Pure Rust Whisper engine just got a bigger appetite: GGUF models, more audio formats, and word timestamps you can trust.

Today we released OxiWhisper 0.1.1 — an incremental update that teaches the loader the modern GGUF format, adds optional multi-core attention, halves KV-cache memory with f16, and graduates word-level timestamps to Stable.

The ground rule has not moved. OxiWhisper is still a Pure Rust OpenAI Whisper inference engine with no C, no C++, and no Python in the build. While the alternatives — whisper.cpp as a vendored C++ tree, Python + PyTorch with libtorch behind it, and ONNX Runtime as a native provider — keep their foreign-language dependencies, OxiWhisper stays a single Rust crate that compiles to one static binary or to WebAssembly. 0.1.1 widens what it can read and how fast it can run without spending any of that sovereignty.

Why OxiWhisper 0.1.1

If 0.1.0 proved the engine was real, 0.1.1 is about meeting the formats and hardware people actually have. The headline is GGUF: the community has largely moved from the legacy ggml-*.bin files to *.gguf, and you should not have to think about which one you grabbed. There is also a simple performance gap — single-threaded decoding leaves cores idle on a laptop, and large models hold more RSS than they need to. This release closes both.

The numbers grew with the surface area:

All of that builds on the same foundation as before — the Arc copy-on-write KV cache that already saved about 4.5 GB of beam-search allocations, and quantized models where tiny still drops from roughly 150 MB to about 40 MB at Q4_0.

Technical Deep Dive

The architecture is unchanged; the pipeline still runs the genuine Whisper encoder-decoder:

Audio -> Mel Spectrogram (OxiFFT) -> Encoder (Conv + Transformer)
-> Decoder (Autoregressive + KV Cache + Beam Search) -> Tokenizer -> Text

What changed sits at several points along that path. At the very front, load_audio() in the audio layer is a new magic-byte auto-detecting loader: behind the symphonia-backed audio-flac, audio-ogg, audio-mp3, audio-aac, and audio-opus features (or audio-all for everything), it decodes those containers directly — no external ffmpeg, all still Rust. The mel front end remains OxiFFT, now upgraded to oxifft 0.3.

At model load time, WhisperModel::from_file() and the new from_file_mmap() inspect magic bytes and transparently accept both legacy GGML and modern GGUF; the API did not change, you simply point at a different path. from_file_mmap() uses memmap2 to back the weights with a memory mapping, which lowers peak RSS noticeably on large models.

Inside the decoder, the SDPA hot path was rewritten. The scalar triple-loops are gone, replaced by matrixmultiply::sgemm, and on the encoder side the attention scratch allocations were hoisted out of the per-head loops so they are not reallocated every head. With the optional parallel feature enabled, the per-head SDPA loops in the decoder and the encoder attention fan out across a rayon pool that threading::set_thread_count(n) configures; it is off by default precisely so WASM and single-threaded builds are untouched. The KV cache gained dtype control through KvCacheDtype { F32, VHalf, KvHalf }, so you can store keys and/or values in f16.

The biggest accuracy change is in alignment. align_tokens_dtw now implements a real Sakoe-Chiba-banded dynamic-programming DTW with traceback, replacing the old monotonic-peak approximation — the result is smoother, more accurate word timestamps when the cross-attention is noisy. The previous behavior lives on as align_tokens_monotonic_peak(), kept as a deprecated shim for SemVer. The tokenizer also got more correct: parse_json_string now decodes UTF-16 surrogate pairs properly, so emoji, Mathematical Alphanumeric Symbols, and CJK Extension B characters survive a round trip through tokenizer.json, while a lone high surrogate such as \uD800 returns an error instead of being silently dropped. Under the hood, quantize.rs was refactored into a src/quantize/ directory of 7 modules, each under 500 lines, with the public API preserved, and a new tests/ directory adds 5 integration binaries gated where appropriate by a test-utils feature for the synthetic model generator.

Getting Started

cargo add oxiwhisper
use oxiwhisper::{WhisperModel, TranscribeOptions};
use std::path::Path;

fn main() -> Result<(), oxiwhisper::OxiWhisperError> {
    // Same API for ggml-*.bin and *.gguf — the loader auto-detects from magic bytes.
    let model = WhisperModel::from_file_mmap(Path::new("ggml-base.gguf"))?;
    let audio = oxiwhisper::audio::load_audio(Path::new("podcast.mp3"))?; // needs audio-mp3 feature
    let result = model.transcribe_long_with_vad(&audio, &TranscribeOptions::default(), Default::default())?;
    println!("{}", result.text);
    Ok(())
}

Two things to notice: the path ends in .gguf and nothing else changed, and the audio is an .mp3 decoded through load_audio() with the audio-mp3 feature — no ffmpeg in sight. If you want the simplest possible long-audio call, model.transcribe_long(&audio, &opts) works just as well.

What’s New in 0.1.1

Tips

This is the foundation

OxiWhisper sits in the COOLJAPAN Pure Rust stack as its speech-recognition layer, and its anchor is still OxiFFT — the dependency that computes the log-mel spectrogram and now rides at version 0.3. SciRS2 and NumRS2 provide the numerical bedrock, and VoiRS, the audio and speech sibling, pairs naturally with OxiWhisper when a project needs text-to-speech on the same Pure Rust footing as speech-to-text. Looking ahead, GPU acceleration is the obvious next frontier for inference of this shape — the broader ecosystem has been building toward that with OxiCUDA — though OxiWhisper today depends on none of it and remains CPU-and-WASM Pure Rust. The direction is clear: a complete, sovereign audio pipeline with no foreign-language runtime anywhere in sight.

Repository: https://github.com/cool-japan/oxiwhisper

Star the repo if a Pure Rust Whisper that reads GGUF and decodes MP3 out of the box is your kind of tool.

Pure Rust speech recognition is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ April 26, 2026

↑ Back to all posts