Speech recognition has belonged to C++ and Python for far too long. Today that changes.
Today we released OxiWhisper 0.1.0 — a Pure Rust inference engine that loads pretrained OpenAI Whisper weights and transcribes audio to text with zero C, C++, or Python in the build.
If you have ever shipped Whisper, you know the toolchain. The fast path is whisper.cpp — a C++ codebase you vendor, patch, and cross-compile per platform. The reference path is Python + PyTorch, which drags in libtorch, a CUDA stack, and an interpreter. The “portable” path is ONNX Runtime, another native C++ dependency with its own provider matrix. Every one of them is a foreign-language artifact bolted onto your Rust binary through FFI. OxiWhisper takes a different position. No C. No C++. No Python. It is one Rust crate that compiles to a single static binary — or to WebAssembly through simd128 — and runs Whisper inference end to end. To be precise about scope: OxiWhisper is an inference engine. It loads existing Whisper checkpoints and transcribes; it is not a trainer.
Why OxiWhisper 0.1.0
The incumbent pain is not accuracy — Whisper is excellent — it is delivery. A C++ dependency means a build system that fights you on every new target, a Python runtime means you ship an interpreter and a multi-gigabyte wheel, and an ONNX provider means matching native libraries to each OS. All three make WASM either painful or impossible.
OxiWhisper deletes that surface area, and it arrives surprisingly complete for a first release:
- 12,596 lines of Rust across 25 modules, Apache-2.0 licensed.
- 278 tests and 10 runnable examples.
- Zero clippy warnings, zero doc warnings, every file under 2,000 lines, and no
unwrap()in production code. - A real KV-cache design that uses
Arccopy-on-write across beam-search hypotheses, saving roughly 4.5 GB of allocations versus naive per-beam copies (a measured number from the changelog). - Quantized models that are dramatically smaller —
tinydrops from about 150 MB (f32) to roughly 40 MB (Q4_0) — with dequantization fused into the GEMV so you pay no separate unpack pass.
Where each piece stands today:
| Subsystem | Status | Tests |
|---|---|---|
| Core inference (encoder/decoder) | Stable | 278 passing |
| Quantized inference (Q4_0/Q5_0/Q8_0) | Stable | 40+ |
| SIMD kernels (AVX2+FMA / NEON / simd128) | Stable | 15+ |
| Streaming | Stable | 8+ |
| Word-level timestamps (DTW) | Alpha | 6 |
| ONNX loading | Stable | 13 |
Technical Deep Dive
OxiWhisper implements the real Whisper encoder-decoder transformer, and the pipeline mirrors the architecture exactly:
Audio (WAV/f32) -> Mel Spectrogram (OxiFFT) -> Encoder (Conv + Transformer)
-> Decoder (Autoregressive + KV Cache + Beam Search) -> Tokenizer -> Text
It starts with audio. The audio module is a pure-Rust WAV parser that handles PCM at 8, 16, 24, and 32 bits plus IEEE float, downmixes multi-channel input, and resamples to the 16 kHz Whisper expects. From there the mel and mel_filters modules compute the log-mel spectrogram — and this is where OxiFFT does the heavy lifting. The FFT that turns each audio frame into a spectrum is OxiFFT, a Pure Rust dependency; the mel front end is not an approximation, it is the genuine Whisper feature extraction built on a real FFT.
The encoder module runs the convolutional stem followed by the transformer stack. The decoder runs autoregressively, and this is where the engineering shows: a KV Cache keyed per beam through Arc copy-on-write, beam_search with configurable width, and decode_utils for the sampling machinery. Attention is computed with matrixmultiply::sgemm for both the QK^T scores and the scores@V product, while the per-element dot products in the quantized path use hand-written SIMD — AVX2+FMA on x86_64, NEON on aarch64, and simd128 on WASM. Tensors reshape zero-copy, and GELU, softmax, and layernorm all run in place against a pre-allocated InferenceBuffer, so steady-state decoding does almost no allocation.
Decoding is rich. You get greedy, beam search, temperature sampling, top-k, and top-p; automatic language detection across 99 languages; timestamps, and word-level timestamps via DTW cross-attention alignment in the dtw module. The hallucination module applies compression-ratio filtering and no_repeat_ngram suppression to catch the runaway loops Whisper is prone to on silence, and previous-context conditioning plus initial_prompt let you steer the output. Models load from GGML files (ggml-tiny.bin and friends) with Q4_0/Q5_0/Q8_0 quantization handled by the quantize module; the optional onnx feature routes through oxionnx for ONNX checkpoints.
Supported model sizes:
| Model | Params | f32 | Q4_0 |
|---|---|---|---|
| tiny | 39M | ~150 MB | ~40 MB |
| base | 74M | — | — |
| small | 244M | — | — |
| medium | 769M | — | — |
| large | 1.5B | ~6 GB | ~1.5 GB |
Getting Started
cargo add oxiwhisper
use oxiwhisper::{WhisperModel, TranscribeOptions};
use std::path::Path;
fn main() -> Result<(), oxiwhisper::OxiWhisperError> {
let model = WhisperModel::from_file(Path::new("ggml-tiny.bin"))?;
let audio = oxiwhisper::audio::load_wav(Path::new("audio.wav"))?;
let text = model.transcribe(&audio, &TranscribeOptions::default())?;
println!("{text}");
Ok(())
}
That is the whole thing: load a GGML checkpoint, read a WAV, transcribe. No native libraries, no Python environment, no model server.
What’s inside
Everything that ships in 0.1.0:
- Pure-Rust GGML loading and quantized inference — GGML model files with Q4_0/Q5_0/Q8_0, dequantized on the fly inside the GEMV.
- OxiFFT-powered mel front end — the real log-mel spectrogram, FFT and all, with no C dependency.
- Full encoder-decoder transformer — convolutional stem, transformer encoder, autoregressive decoder, and the
Arccopy-on-write KV cache. - Rich decoding — beam search, greedy, temperature, top-k, top-p; 99-language auto-detection; and the hallucination filters (
compression_ratio_threshold,no_repeat_ngram_size,suppress_tokens). - Long audio, VAD, streaming, and batch —
transcribe_long()for clips over 30 s, RMS-energy VAD with an adaptive threshold and VAD-aware chunking, real-timestream(), andtranscribe_batch(). - Subtitle export —
transcribe_to_srt()andtranscribe_to_vtt()emit subtitle files directly. - Analysis APIs —
model_stats(),encoder_output(), andmel_spectrogram()expose intermediate stages for inspection and tooling. - Optional
onnxandserde— ONNX checkpoints viaoxionnx, andserdefor JSON-serializable output.
Tips
- Just call
transcribe()for short clips. Grab a Whisper GGML model such asggml-tiny.bin, load a WAV, and you are done. For anything over 30 seconds, reach fortranscribe_long()ortranscribe_long_with_vad()so chunking and context carry-over are handled for you. - Quantize to shrink models. Q4_0/Q5_0/Q8_0 cut footprint hard —
tinygoes from roughly 150 MB to about 40 MB — and the SIMD path dequantizes on the fly, so smaller does not mean a separate unpack cost. - Stream for real time. Use
stream()withStreamTranscriberto feed arbitrary-sized chunks as audio arrives, instead of waiting for a complete file. - Set the language, or don’t. Pass
TranscribeOptions { language: Some("ja"), .. }to force a language; leave itNoneand the 99-language detector picks one for you. - Tune accuracy vs. speed. Raise
beam_widthfor more careful decoding, or settemperatureabove zero (withtop_k/top_p) for sampled output.initial_prompt,suppress_tokens,no_repeat_ngram_size,compression_ratio_threshold, andprevious_tokensgive you fine control over the result. - Export subtitles in one call with
transcribe_to_srt()ortranscribe_to_vtt(), and enable theserdefeature when you want structured JSON withtimestampsturned on.
This is the foundation
OxiWhisper does not stand alone — it is the speech-recognition layer of the COOLJAPAN Pure Rust stack. Its closest tie is OxiFFT, which computes the log-mel spectrogram at the front of the pipeline; without a real, Pure Rust FFT there is no Whisper front end, and OxiFFT is exactly that. Around it sit SciRS2 and NumRS2 for the numerical groundwork, and VoiRS as the audio and speech sibling — a natural companion when you want text-to-speech alongside speech-to-text. Together they make a fully sovereign audio pipeline with no foreign-language runtime anywhere in the build.
Repository: https://github.com/cool-japan/oxiwhisper
Star the repo if Pure Rust speech recognition is something you have been waiting for.
Pure Rust speech recognition is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ March 27, 2026