OxiWhisper 0.1.0 Released — Pure Rust Whisper Speech-to-Text, No C, No Python

Speech recognition has belonged to C++ and Python for far too long. Today that changes.

Today we released OxiWhisper 0.1.0 — a Pure Rust inference engine that loads pretrained OpenAI Whisper weights and transcribes audio to text with zero C, C++, or Python in the build.

If you have ever shipped Whisper, you know the toolchain. The fast path is whisper.cpp — a C++ codebase you vendor, patch, and cross-compile per platform. The reference path is Python + PyTorch, which drags in libtorch, a CUDA stack, and an interpreter. The “portable” path is ONNX Runtime, another native C++ dependency with its own provider matrix. Every one of them is a foreign-language artifact bolted onto your Rust binary through FFI. OxiWhisper takes a different position. No C. No C++. No Python. It is one Rust crate that compiles to a single static binary — or to WebAssembly through simd128 — and runs Whisper inference end to end. To be precise about scope: OxiWhisper is an inference engine. It loads existing Whisper checkpoints and transcribes; it is not a trainer.

Why OxiWhisper 0.1.0

The incumbent pain is not accuracy — Whisper is excellent — it is delivery. A C++ dependency means a build system that fights you on every new target, a Python runtime means you ship an interpreter and a multi-gigabyte wheel, and an ONNX provider means matching native libraries to each OS. All three make WASM either painful or impossible.

OxiWhisper deletes that surface area, and it arrives surprisingly complete for a first release:

12,596 lines of Rust across 25 modules, Apache-2.0 licensed.
278 tests and 10 runnable examples.
Zero clippy warnings, zero doc warnings, every file under 2,000 lines, and no unwrap() in production code.
A real KV-cache design that uses Arc copy-on-write across beam-search hypotheses, saving roughly 4.5 GB of allocations versus naive per-beam copies (a measured number from the changelog).
Quantized models that are dramatically smaller — tiny drops from about 150 MB (f32) to roughly 40 MB (Q4_0) — with dequantization fused into the GEMV so you pay no separate unpack pass.

Where each piece stands today:

Subsystem	Status	Tests
Core inference (encoder/decoder)	Stable	278 passing
Quantized inference (Q4_0/Q5_0/Q8_0)	Stable	40+
SIMD kernels (AVX2+FMA / NEON / simd128)	Stable	15+
Streaming	Stable	8+
Word-level timestamps (DTW)	Alpha	6
ONNX loading	Stable	13

Technical Deep Dive

OxiWhisper implements the real Whisper encoder-decoder transformer, and the pipeline mirrors the architecture exactly:

Audio (WAV/f32) -> Mel Spectrogram (OxiFFT) -> Encoder (Conv + Transformer)
-> Decoder (Autoregressive + KV Cache + Beam Search) -> Tokenizer -> Text

It starts with audio. The audio module is a pure-Rust WAV parser that handles PCM at 8, 16, 24, and 32 bits plus IEEE float, downmixes multi-channel input, and resamples to the 16 kHz Whisper expects. From there the mel and mel_filters modules compute the log-mel spectrogram — and this is where OxiFFT does the heavy lifting. The FFT that turns each audio frame into a spectrum is OxiFFT, a Pure Rust dependency; the mel front end is not an approximation, it is the genuine Whisper feature extraction built on a real FFT.

The encoder module runs the convolutional stem followed by the transformer stack. The decoder runs autoregressively, and this is where the engineering shows: a KV Cache keyed per beam through Arc copy-on-write, beam_search with configurable width, and decode_utils for the sampling machinery. Attention is computed with matrixmultiply::sgemm for both the QK^T scores and the scores@V product, while the per-element dot products in the quantized path use hand-written SIMD — AVX2+FMA on x86_64, NEON on aarch64, and simd128 on WASM. Tensors reshape zero-copy, and GELU, softmax, and layernorm all run in place against a pre-allocated InferenceBuffer, so steady-state decoding does almost no allocation.

Decoding is rich. You get greedy, beam search, temperature sampling, top-k, and top-p; automatic language detection across 99 languages; timestamps, and word-level timestamps via DTW cross-attention alignment in the dtw module. The hallucination module applies compression-ratio filtering and no_repeat_ngram suppression to catch the runaway loops Whisper is prone to on silence, and previous-context conditioning plus initial_prompt let you steer the output. Models load from GGML files (ggml-tiny.bin and friends) with Q4_0/Q5_0/Q8_0 quantization handled by the quantize module; the optional onnx feature routes through oxionnx for ONNX checkpoints.

Supported model sizes:

Model	Params	f32	Q4_0
tiny	39M	~150 MB	~40 MB
base	74M	—	—
small	244M	—	—
medium	769M	—	—
large	1.5B	~6 GB	~1.5 GB

Getting Started

cargo add oxiwhisper

use oxiwhisper::{WhisperModel, TranscribeOptions};
use std::path::Path;

fn main() -> Result<(), oxiwhisper::OxiWhisperError> {
    let model = WhisperModel::from_file(Path::new("ggml-tiny.bin"))?;
    let audio = oxiwhisper::audio::load_wav(Path::new("audio.wav"))?;
    let text = model.transcribe(&audio, &TranscribeOptions::default())?;
    println!("{text}");
    Ok(())
}

That is the whole thing: load a GGML checkpoint, read a WAV, transcribe. No native libraries, no Python environment, no model server.

What’s inside

Everything that ships in 0.1.0:

Pure-Rust GGML loading and quantized inference — GGML model files with Q4_0/Q5_0/Q8_0, dequantized on the fly inside the GEMV.
OxiFFT-powered mel front end — the real log-mel spectrogram, FFT and all, with no C dependency.
Full encoder-decoder transformer — convolutional stem, transformer encoder, autoregressive decoder, and the Arc copy-on-write KV cache.
Rich decoding — beam search, greedy, temperature, top-k, top-p; 99-language auto-detection; and the hallucination filters (compression_ratio_threshold, no_repeat_ngram_size, suppress_tokens).
Long audio, VAD, streaming, and batch — transcribe_long() for clips over 30 s, RMS-energy VAD with an adaptive threshold and VAD-aware chunking, real-time stream(), and transcribe_batch().
Subtitle export — transcribe_to_srt() and transcribe_to_vtt() emit subtitle files directly.
Analysis APIs — model_stats(), encoder_output(), and mel_spectrogram() expose intermediate stages for inspection and tooling.
Optional onnx and serde — ONNX checkpoints via oxionnx, and serde for JSON-serializable output.

Tips

Just call transcribe() for short clips. Grab a Whisper GGML model such as ggml-tiny.bin, load a WAV, and you are done. For anything over 30 seconds, reach for transcribe_long() or transcribe_long_with_vad() so chunking and context carry-over are handled for you.
Quantize to shrink models. Q4_0/Q5_0/Q8_0 cut footprint hard — tiny goes from roughly 150 MB to about 40 MB — and the SIMD path dequantizes on the fly, so smaller does not mean a separate unpack cost.
Stream for real time. Use stream() with StreamTranscriber to feed arbitrary-sized chunks as audio arrives, instead of waiting for a complete file.
Set the language, or don’t. Pass TranscribeOptions { language: Some("ja"), .. } to force a language; leave it None and the 99-language detector picks one for you.
Tune accuracy vs. speed. Raise beam_width for more careful decoding, or set temperature above zero (with top_k/top_p) for sampled output. initial_prompt, suppress_tokens, no_repeat_ngram_size, compression_ratio_threshold, and previous_tokens give you fine control over the result.
Export subtitles in one call with transcribe_to_srt() or transcribe_to_vtt(), and enable the serde feature when you want structured JSON with timestamps turned on.

This is the foundation

OxiWhisper does not stand alone — it is the speech-recognition layer of the COOLJAPAN Pure Rust stack. Its closest tie is OxiFFT, which computes the log-mel spectrogram at the front of the pipeline; without a real, Pure Rust FFT there is no Whisper front end, and OxiFFT is exactly that. Around it sit SciRS2 and NumRS2 for the numerical groundwork, and VoiRS as the audio and speech sibling — a natural companion when you want text-to-speech alongside speech-to-text. Together they make a fully sovereign audio pipeline with no foreign-language runtime anywhere in the build.

Repository: https://github.com/cool-japan/oxiwhisper

Star the repo if Pure Rust speech recognition is something you have been waiting for.

Pure Rust speech recognition is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ March 27, 2026