COOLJAPAN
← All posts

OxiWhisper 0.1.0 Released — Pure Rust Whisper Speech-to-Text, No C, No Python

OxiWhisper 0.1.0 is a Pure Rust OpenAI Whisper inference engine — GGML loading, Q4_0/Q5_0/Q8_0 quantized inference, beam search, 99-language detection, streaming, and SRT/VTT export, with zero C/C++/Python dependencies.

release oxiwhisper whisper speech-to-text asr rust audio transcription pure-rust

Speech recognition has belonged to C++ and Python for far too long. Today that changes.

Today we released OxiWhisper 0.1.0 — a Pure Rust inference engine that loads pretrained OpenAI Whisper weights and transcribes audio to text with zero C, C++, or Python in the build.

If you have ever shipped Whisper, you know the toolchain. The fast path is whisper.cpp — a C++ codebase you vendor, patch, and cross-compile per platform. The reference path is Python + PyTorch, which drags in libtorch, a CUDA stack, and an interpreter. The “portable” path is ONNX Runtime, another native C++ dependency with its own provider matrix. Every one of them is a foreign-language artifact bolted onto your Rust binary through FFI. OxiWhisper takes a different position. No C. No C++. No Python. It is one Rust crate that compiles to a single static binary — or to WebAssembly through simd128 — and runs Whisper inference end to end. To be precise about scope: OxiWhisper is an inference engine. It loads existing Whisper checkpoints and transcribes; it is not a trainer.

Why OxiWhisper 0.1.0

The incumbent pain is not accuracy — Whisper is excellent — it is delivery. A C++ dependency means a build system that fights you on every new target, a Python runtime means you ship an interpreter and a multi-gigabyte wheel, and an ONNX provider means matching native libraries to each OS. All three make WASM either painful or impossible.

OxiWhisper deletes that surface area, and it arrives surprisingly complete for a first release:

Where each piece stands today:

SubsystemStatusTests
Core inference (encoder/decoder)Stable278 passing
Quantized inference (Q4_0/Q5_0/Q8_0)Stable40+
SIMD kernels (AVX2+FMA / NEON / simd128)Stable15+
StreamingStable8+
Word-level timestamps (DTW)Alpha6
ONNX loadingStable13

Technical Deep Dive

OxiWhisper implements the real Whisper encoder-decoder transformer, and the pipeline mirrors the architecture exactly:

Audio (WAV/f32) -> Mel Spectrogram (OxiFFT) -> Encoder (Conv + Transformer)
-> Decoder (Autoregressive + KV Cache + Beam Search) -> Tokenizer -> Text

It starts with audio. The audio module is a pure-Rust WAV parser that handles PCM at 8, 16, 24, and 32 bits plus IEEE float, downmixes multi-channel input, and resamples to the 16 kHz Whisper expects. From there the mel and mel_filters modules compute the log-mel spectrogram — and this is where OxiFFT does the heavy lifting. The FFT that turns each audio frame into a spectrum is OxiFFT, a Pure Rust dependency; the mel front end is not an approximation, it is the genuine Whisper feature extraction built on a real FFT.

The encoder module runs the convolutional stem followed by the transformer stack. The decoder runs autoregressively, and this is where the engineering shows: a KV Cache keyed per beam through Arc copy-on-write, beam_search with configurable width, and decode_utils for the sampling machinery. Attention is computed with matrixmultiply::sgemm for both the QK^T scores and the scores@V product, while the per-element dot products in the quantized path use hand-written SIMD — AVX2+FMA on x86_64, NEON on aarch64, and simd128 on WASM. Tensors reshape zero-copy, and GELU, softmax, and layernorm all run in place against a pre-allocated InferenceBuffer, so steady-state decoding does almost no allocation.

Decoding is rich. You get greedy, beam search, temperature sampling, top-k, and top-p; automatic language detection across 99 languages; timestamps, and word-level timestamps via DTW cross-attention alignment in the dtw module. The hallucination module applies compression-ratio filtering and no_repeat_ngram suppression to catch the runaway loops Whisper is prone to on silence, and previous-context conditioning plus initial_prompt let you steer the output. Models load from GGML files (ggml-tiny.bin and friends) with Q4_0/Q5_0/Q8_0 quantization handled by the quantize module; the optional onnx feature routes through oxionnx for ONNX checkpoints.

Supported model sizes:

ModelParamsf32Q4_0
tiny39M~150 MB~40 MB
base74M
small244M
medium769M
large1.5B~6 GB~1.5 GB

Getting Started

cargo add oxiwhisper
use oxiwhisper::{WhisperModel, TranscribeOptions};
use std::path::Path;

fn main() -> Result<(), oxiwhisper::OxiWhisperError> {
    let model = WhisperModel::from_file(Path::new("ggml-tiny.bin"))?;
    let audio = oxiwhisper::audio::load_wav(Path::new("audio.wav"))?;
    let text = model.transcribe(&audio, &TranscribeOptions::default())?;
    println!("{text}");
    Ok(())
}

That is the whole thing: load a GGML checkpoint, read a WAV, transcribe. No native libraries, no Python environment, no model server.

What’s inside

Everything that ships in 0.1.0:

Tips

This is the foundation

OxiWhisper does not stand alone — it is the speech-recognition layer of the COOLJAPAN Pure Rust stack. Its closest tie is OxiFFT, which computes the log-mel spectrogram at the front of the pipeline; without a real, Pure Rust FFT there is no Whisper front end, and OxiFFT is exactly that. Around it sit SciRS2 and NumRS2 for the numerical groundwork, and VoiRS as the audio and speech sibling — a natural companion when you want text-to-speech alongside speech-to-text. Together they make a fully sovereign audio pipeline with no foreign-language runtime anywhere in the build.

Repository: https://github.com/cool-japan/oxiwhisper

Star the repo if Pure Rust speech recognition is something you have been waiting for.

Pure Rust speech recognition is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ March 27, 2026

↑ Back to all posts