MeCrab 0.1.0 Released — A Pure Rust MeCab, Japanese Morphological Analysis Without the C Toolchain

MeCab is the backbone of Japanese NLP. It is also a C++ library you have to build, link, and trust. MeCrab is a pure Rust replacement — and today it ships its first release.

Today we released MeCrab 0.1.0 — a high-performance, thread-safe morphological analyzer compatible with MeCab dictionaries (IPADIC format), written entirely in Rust.

For anyone who has wired Japanese text into a pipeline, the shape of the problem is familiar: you reach for MeCab, and you inherit a C++ build, a system library to locate, and FFI bindings that fight you on every platform. MeCrab takes the dictionaries you already have and reads them from pure Rust. The result is the same morphological analysis — segmentation into morphemes, part-of-speech tagging, readings — with none of the native-toolchain friction.

No C. No C++. No libmecab to find on the system. MeCrab is pure Rust: it parses your IPADIC dictionary directly, runs the Viterbi lattice in safe code, and compiles to a single static binary (and to WASM, and to a Python extension). Thread safety is not a runtime promise bolted on afterward — it falls out of Rust’s ownership model, so concurrent analysis is safe by construction.

Why MeCrab matters

MeCab is excellent and battle-tested. The pain was never the algorithm — it was everything around it. MeCrab keeps the algorithm and removes the friction:

Compatible with the dictionaries you have. MeCrab reads MeCab IPADIC-format dictionaries. You point it at ipadic-utf8 and it works — no re-training, no conversion step.
Zero-copy dictionary loading. Dictionaries are memory-mapped, so startup does not pay to copy hundreds of thousands of entries into the heap — the OS pages them in on demand.
A SIMD-accelerated Viterbi. The lattice search — the heart of morphological analysis — uses AVX2-accelerated cost computation, with parallel batch processing for throughput on large corpora.
Live dictionary updates. Add or remove words at runtime without restarting the process — useful when your domain vocabulary (product names, jargon) is not in IPADIC.
One binary, three targets. The same core ships as a native binary, a WASM module, and a Python extension, so the analyzer that runs in your backend can also run in the browser.

This is a 0.1.0 — an early but solid first release. The core is real and tested: roughly 11,000 lines of Rust, 174 passing tests, 4 fuzz targets, and zero clippy warnings.

Technical Deep Dive: the workspace

MeCrab is organized as a focused Cargo workspace, so you take only the weight you need:

mecrab — the core runtime library. Deliberately lightweight: memory-mapped IPADIC loading, the Viterbi lattice, the public MeCrab analyzer type, and supporting modules for streaming and phonetic transduction. This is the crate you embed.
kizame — the CLI (KizaMe, 刻め!). The command-line front-end. Lightweight by default — cargo install kizame gives you parsing, wakati output, and JSON without dragging in the heavy data pipeline.
mecrab-word2vec — word embeddings. A pure-Rust Word2Vec implementation with Hogwild! lock-free parallelization, for training vectors over a tokenized corpus.
mecrab-builder — the dictionary pipeline. The heavier crate (it pulls in tokio and an HTTP client) for building semantically-enriched dictionaries — kept out of the default install so the common case stays small.

The split matters: the runtime analyzer that most users embed depends on a minimal set of crates (memmap2, byteorder, encoding_rs, yada for the double-array trie), while the corpus-building machinery lives behind a feature so it never bloats a simple tokenization dependency.

Getting Started

Install the KizaMe CLI:

cargo install kizame

Initialize a dictionary (it will locate a system IPADIC), then parse:

# Find and register a system IPADIC dictionary
kizame dict init

# Parse the classic ambiguous sentence
echo "すもももももももものうち" | kizame

# Space-separated tokens (wakati)
echo "日本語の形態素解析" | kizame -w

# JSON output
echo "東京都" | kizame -O json

Or embed the analyzer in Rust:

[dependencies]
mecrab = "0.1"

use mecrab::MeCrab;

let mecrab = MeCrab::new()?;
let result = mecrab.parse("すもももももももものうち")?;
println!("{}", result);

// Add a domain-specific word at runtime — no restart
mecrab.add_word("ChatGPT", "チャットジーピーティー", "チャットジーピーティー", 5000);

What’s inside

MeCab IPADIC compatibility — reads standard IPADIC-format dictionaries directly from pure Rust.
Memory-mapped, zero-copy loading — dictionaries are mapped, not copied, into the address space.
SIMD-accelerated Viterbi — AVX2-accelerated lattice cost computation with parallel batch processing.
Live dictionary updates — add/remove words at runtime without a restart.
Streaming processing — sentence-boundary detection for analyzing large text.
Text normalization — NFKC, width conversion, case folding.
Phonetic transduction — Kana ↔ Romaji, X-SAMPA, and IPA conversion.
Pure-Rust Word2Vec — Hogwild!-parallelized embedding training in mecrab-word2vec.
Cross-platform — native binaries, WASM, and Python bindings from one core.
The KizaMe CLI — install with cargo install kizame; lightweight by default.

Tips

Start with kizame dict init. It locates a system IPADIC so you do not have to pass -d on every invocation. To override, use kizame -d /var/lib/mecab/dic/ipadic-utf8 parse.
Use -w for tokenization, -O json for downstream tooling. Wakati output is the right shape for search indexing; JSON carries the part-of-speech and feature fields your pipeline probably wants.
Embed the lightweight mecrab crate, not the builder. For runtime analysis you only need mecrab; cargo install kizame stays small because the heavy Wikidata pipeline is gated behind --features builder.
Add domain vocabulary at runtime. add_word(...) injects product names and jargon that IPADIC has never heard of — no dictionary rebuild, no process restart.
Turn on parallel for batch corpora. mecrab = { version = "0.1", features = ["json", "parallel"] } enables parallel batch processing when you are analyzing many documents at once.

This is the foundation

MeCrab is the first natural-language tool in the COOLJAPAN ecosystem — a Pure Rust stack that already includes OxiBLAS and OxiCode for numerics and serialization, alongside the broader scientific-computing work. Japanese morphological analysis is a load-bearing primitive for everything downstream — search, embeddings, language modeling — and getting it into safe, dependency-light Rust is the groundwork that the rest of that work can stand on. This 0.1.0 is the start of that line.

Repository: https://github.com/cool-japan/mecrab

Star the repo if you have ever lost an afternoon to building MeCab from source. Pure Rust Japanese NLP is here — fast, safe, and free of the C toolchain.

— KitaSan at COOLJAPAN OÜ January 6, 2026