VoiRS 0.1.0 Release Candidate 1 — Pure Rust Neural TTS, Voice Recognition & Sound Framework

The speech synthesis and voice AI foundation of the COOLJAPAN ecosystem just reached its first Release Candidate.

Today we released VoiRS 0.1.0 Release Candidate 1 — a complete, production-grade pure Rust framework for neural Text-to-Speech (TTS), Voice Recognition, and high-performance Sound processing.

No Python. No C++. No FFmpeg. No external model runtimes.
No unsafe code in hot paths. No dependency hell.
Just clean, memory-safe, real-time neural speech that compiles to a single static binary (or WASM) and runs everywhere — from laptops to browsers to edge devices to cloud GPUs.

Why VoiRS 0.1.0 RC1 is a game changer

For years, state-of-the-art speech synthesis and voice AI meant depending on heavy Python stacks (Coqui TTS, Tortoise, Piper) or proprietary cloud services.

These tools are powerful but suffer from:

Python interpreter overhead and slow inference
Memory unsafety and complex C++/CUDA dependencies
Vendor lock-in and latency issues
Difficulty in offline, WASM, or embedded deployment
Lack of full training pipelines in a single safe language

VoiRS 0.1.0 RC1 ends all of that.

It delivers real-time performance while being 100% memory-safe and fully portable.
Notable results:

Real-time factor (RTF): ≤ 0.3× on consumer CPUs
GPU (RTX 4080): ≤ 0.05× RTF (0.04× demonstrated)
Streaming synthesis with low-latency chunked audio

Technical Deep Dive: How We Built a Production-Grade Neural Speech Stack in Pure Rust

The architecture unifies high-performance crates from the COOLJAPAN ecosystem into a clean, end-to-end pipeline:

Core Pipeline
Text → G2P (pluggable: Phonetisaurus, OpenJTalk, Neural) → Acoustic Model (VITS / FastSpeech2) → Vocoder (HiFi-GAN + DiffWave) → Audio (WAV/OGG).
Neural Models (0.1.0 RC1 highlights)
- Full VITS + HiFi-GAN inference
- New DiffWave vocoder training pipeline with gradient updates, SafeTensors checkpoints (370 parameters, 30 MB), and real parameter saving
- ONNX Runtime integration for Kokoro-82M (9 languages, 54 voices) — zero Python required
Advanced Features
- Streaming synthesis (chunk-based low-latency)
- SSML support
- Multilingual (20+ languages; production-ready English/Japanese, beta Spanish/French/German/Mandarin)
- Automatic IPA generation via eSpeak NG backend
Hardware & Interop
- GPU acceleration (CUDA on Linux/Windows, Metal on macOS)
- WASM target for browser-native synthesis
- FFI bindings (C, PyO3 Python, NAPI Node.js, Unity/Unreal plugins)

Key Rust advantages:

100% Pure Rust core (SciRS2/NumRS2 for DSP and linear algebra)
SIMD optimizations throughout
SafeTensors for production model persistence
No-unwrap policy + enforced Clippy0/fail0
7 crates with clean separation (voirs-g2p, voirs-acoustic, voirs-vocoder, voirs-dataset, voirs-cli, etc.)

What’s inside 0.1.0 RC1 (released March 26)

Full DiffWave training pipeline with gradient-based learning and SafeTensors checkpoints
Kokoro-82M ONNX multilingual TTS integration
Streaming synthesis and SSML support stabilized
WASM + GPU backends production-ready
CLI tool (voirs-cli) for synthesis, training, and voice management
Production readiness confirmed with comprehensive tests and benchmarks

This is the foundation

VoiRS is now the official speech synthesis and voice AI backend for the entire COOLJAPAN stack (total ecosystem: 21M+ SLoC Rust, 597 crates, 40+ production-grade libraries):

SciRS2 / NumRS2 — all DSP, linear algebra, and neural operations
OxiMedia — real-time audio/video pipelines and avatar voice sync
OptiRS — training optimizers for custom voice models
ToRSh / OxiRAG — conversational voice RAG and agent audio
OxiHuman — biomechanical voice animation and lip-sync
Future integration with OxiLean for formally verified TTS pipelines

Repository: https://github.com/cool-japan/voirs

Star the repo if you want real-time, memory-safe, sovereign neural speech synthesis without Python or cloud dependencies.

The era of “just pip install TTS” with all its overhead is over.

Pure Rust neural TTS, voice recognition, and sound processing is here — fast, safe, multilingual, and sovereign.

— KitaSan at COOLJAPAN OÜ March 26, 2026