COOLJAPAN
2026-03-26

VoiRS 0.1.0 Release Candidate 1 — Pure Rust Neural TTS, Voice Recognition & Sound Framework

Production-grade pure Rust Text-to-Speech (TTS), Voice Recognition, and Sound framework. VITS + HiFi-GAN/DiffWave vocoders, real-time ≤0.05× RTF on GPU, streaming synthesis, SSML, 20+ languages, ONNX/Kokoro-82M support, SafeTensors checkpoints. Full integration with SciRS2/NumRS2. WASM, GPU (CUDA/Metal), Python/FFI bindings. The sovereign speech AI layer for the entire COOLJAPAN ecosystem (now 21M+ SLoC total).

The speech synthesis and voice AI foundation of the COOLJAPAN ecosystem just reached its first Release Candidate.

Today we released VoiRS 0.1.0 Release Candidate 1 — a complete, production-grade pure Rust framework for neural Text-to-Speech (TTS), Voice Recognition, and high-performance Sound processing.

No Python. No C++. No FFmpeg. No external model runtimes.
No unsafe code in hot paths. No dependency hell.
Just clean, memory-safe, real-time neural speech that compiles to a single static binary (or WASM) and runs everywhere — from laptops to browsers to edge devices to cloud GPUs.

Why VoiRS 0.1.0 RC1 is a game changer

For years, state-of-the-art speech synthesis and voice AI meant depending on heavy Python stacks (Coqui TTS, Tortoise, Piper) or proprietary cloud services.

These tools are powerful but suffer from:

VoiRS 0.1.0 RC1 ends all of that.

It delivers real-time performance while being 100% memory-safe and fully portable.
Notable results:

Technical Deep Dive: How We Built a Production-Grade Neural Speech Stack in Pure Rust

The architecture unifies high-performance crates from the COOLJAPAN ecosystem into a clean, end-to-end pipeline:

  1. Core Pipeline
    Text → G2P (pluggable: Phonetisaurus, OpenJTalk, Neural) → Acoustic Model (VITS / FastSpeech2) → Vocoder (HiFi-GAN + DiffWave) → Audio (WAV/OGG).

  2. Neural Models (0.1.0 RC1 highlights)

    • Full VITS + HiFi-GAN inference
    • New DiffWave vocoder training pipeline with gradient updates, SafeTensors checkpoints (370 parameters, 30 MB), and real parameter saving
    • ONNX Runtime integration for Kokoro-82M (9 languages, 54 voices) — zero Python required
  3. Advanced Features

    • Streaming synthesis (chunk-based low-latency)
    • SSML support
    • Multilingual (20+ languages; production-ready English/Japanese, beta Spanish/French/German/Mandarin)
    • Automatic IPA generation via eSpeak NG backend
  4. Hardware & Interop

    • GPU acceleration (CUDA on Linux/Windows, Metal on macOS)
    • WASM target for browser-native synthesis
    • FFI bindings (C, PyO3 Python, NAPI Node.js, Unity/Unreal plugins)

Key Rust advantages:

What’s inside 0.1.0 RC1 (released March 26)

This is the foundation

VoiRS is now the official speech synthesis and voice AI backend for the entire COOLJAPAN stack (total ecosystem: 21M+ SLoC Rust, 597 crates, 40+ production-grade libraries):

Repository: https://github.com/cool-japan/voirs

Star the repo if you want real-time, memory-safe, sovereign neural speech synthesis without Python or cloud dependencies.

The era of “just pip install TTS” with all its overhead is over.

Pure Rust neural TTS, voice recognition, and sound processing is here — fast, safe, multilingual, and sovereign.

KitaSan at COOLJAPAN OÜ March 26, 2026