A 1.7B ternary model is only as fast as the work you can skip — so 0.1.3 teaches OxiBonsai to skip the parts it has already done.
Today we released OxiBonsai 0.1.3 — a serving-efficiency release that adds a prefix-cache-aware inference engine, runtime tokenizer auto-detection, and a GPU weight cache, so repeated requests reuse work instead of redoing it.
No llama.cpp. No BLAS. No C, no C++, no Fortran, no FFI. OxiBonsai is the Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU (AVX2 / AVX-512 / NEON / WASM SIMD), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC). The previous release, 0.1.2, opened the door to ONNX-quantized weights by letting oxibonsai convert --onnx ingest onnx-community Ternary models (MatMulNBits, bits=2) and repack them as GGUF. 0.1.3 turns its attention from getting models in to serving them well.
Why OxiBonsai 0.1.3 matters
The fastest token is the one you never have to compute. Chat and agent workloads spend a surprising fraction of their compute re-processing the same prefix over and over: a shared system prompt, a tool schema, a few-shot preamble. Every request re-runs prefill across those identical tokens before it gets to the part that actually differs.
0.1.3 closes that gap with a prefix-cache-aware inference engine. When a new request shares a prefix with a cached one, the engine reuses the existing KV-cache for the common span and only computes the suffix. The hard part is not the speedup — it’s making sure the speedup is free of side effects. A cache that produces even slightly different logits than a cold run is a correctness bug waiting to surface in production. So this release ships with a guarantee, not just an optimization: the cached path matches the cold-cache path byte-for-byte.
Two more pieces of friction get removed in the same release. Tokenizer auto-detection means OxiBonsai now figures out at runtime whether a model ships a SentencePiece tokenizer or a HuggingFace tokenizers JSON, and bridges to the right one — no manual flag, no guessing. And on the GPU side, a new weight cache uploads each layer’s weights to the device once and reuses them across every decode step, instead of re-staging them token after token.
Technical Deep Dive
PrefixCachedEngine — KV reuse with cold/warm parity. The new engine in oxibonsai-runtime tracks cached prefixes and serves a matching request by replaying the stored KV-cache for the shared span, then running prefill only over the divergent suffix. The release’s headline correctness fix is exactly here: the cached path’s output now matches the cold-cache path byte-for-byte in tests. That parity is what makes prefix caching safe to leave on — a warm hit and a cold miss for the same final prompt produce identical tokens, so caching never changes what the model says, only how fast it says it. The server and engine were refactored to thread the prefix-cache plumbing through the request path.
Tokenizer auto-detection + bridge. OxiBonsai detects the tokenizer flavor at runtime — SentencePiece versus HF tokenizers — and routes through a tokenizer bridge so the rest of the stack stays agnostic. oxibonsai-tokenizer got upgrades to support this, plus new CLI tokenizer-management commands. Pairing the detector with the existing tokenizer download flow means you can fetch and use a model’s tokenizer without telling OxiBonsai which kind it is.
GPU weight cache for BonsaiModel. Both the Metal and CUDA full-forward paths now share a gpu_cache.rs module that holds device-resident weights. Weights are uploaded a single time and reused across decode steps, so per-token decode stops paying repeated host→device transfer cost. To support this cleanly, the full-forward layer was split into dedicated forward_metal.rs and forward_cuda.rs modules with the shared cache between them.
CUDA ternary encode kernels. oxibonsai-kernels gains CUDA ternary (TQ2) encoding kernels along the encode_ternary path, plus a full-forward CUDA fused layer — extending the native NVRTC backend’s ternary coverage. This is the same fused-full-forward design that drives OxiBonsai’s GPU throughput: a single command buffer per token rather than one submission per GEMV. On a Ternary-Bonsai-1.7B model, the fused paths land around ~50 tok/s on Apple Silicon Metal and ~21.9 tok/s on an RTX 3060 with CUDA.
Plumbing and policy. This release bumps the oxifft workspace dependency to 0.3 and updates the CUDA dependency (cudarc) for cleaner CUDA 11.x / 12.x compatibility. The download_ternary.sh script now downloads shards in parallel with clearer error messages, and .gitattributes enforces LF line endings on shell scripts. Internally, oxibonsai-model::model::types (1857 lines) was refactored into a types/ directory — mod.rs, forward_cuda.rs, forward_metal.rs, gpu_cache.rs — keeping every source file under the 2000-line ceiling.
Getting Started
cargo install oxibonsai-cli # installs the `oxibonsai` binary (Rust 1.86+)
Grab a model and its tokenizer, then start the OpenAI-compatible server:
# Fetch + convert a ternary model to GGUF (also downloads the tokenizer)
./scripts/download_ternary.sh 1.7b
# Serve it — repeated requests that share a prefix reuse the cached KV-cache
oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf --host 127.0.0.1 --port 8080
Send two chat completions that share the same system prompt and watch the second one skip the shared prefill:
curl -s http://127.0.0.1:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
"messages": [
{"role": "system", "content": "You are a terse assistant for the COOLJAPAN ecosystem."},
{"role": "user", "content": "Explain ternary quantization in one sentence."}
]
}'
# Re-send with the SAME system prompt and a new user turn — the shared prefix is reused.
No tokenizer flag is required: OxiBonsai auto-detects SentencePiece vs. HF tokenizers at load time.
What’s New in 0.1.3
- Prefix-cache-aware inference engine (
PrefixCachedEngine) — reuses KV-cache across requests, computing only the divergent suffix of a shared prefix. - Byte-identical cold/warm parity — the cached path now matches the cold-cache path byte-for-byte, so prefix caching never changes the output, only the latency.
- Runtime tokenizer auto-detection — SentencePiece vs. HuggingFace
tokenizersis detected automatically via a tokenizer bridge; no manual flag. - GPU weight cache for
BonsaiModel— weights upload to the device once and are reused across every decode step (Metal and CUDA, via a sharedgpu_cache.rs). - CUDA ternary encode kernels — new
encode_ternarypath and a full-forward CUDA fused layer inoxibonsai-kernels. oxibonsai-tokenizerupgrades + new CLI tokenizer-management commands.- Refactor and deps — full-forward split into
forward_metal.rs/forward_cuda.rs;oxifft→ 0.3;cudarcupdated for CUDA 11.x / 12.x;download_ternary.shparallel downloads;.gitattributesLF enforcement;model::typessplit under the 2000-line policy.
Tips
- Lean on shared system prompts. Put the stable, identical part of every request — system prompt, tool schema, few-shot preamble — first. The
PrefixCachedEnginereuses the KV-cache for that shared span across requests, so chat and agent loops pay prefill on the common prefix once. - Trust the warm path. Because the cached path is byte-identical to a cold run, you can leave prefix caching on in production without worrying that a cache hit will subtly change a model’s answer versus a cache miss.
- Drop the tokenizer flag. Tokenizer auto-detection removes the SentencePiece-vs-HF config decision. Fetch a tokenizer with
oxibonsai tokenizer downloadand let the runtime pick the right backend at load time; reach for the new CLI tokenizer-management commands when you want to inspect or manage one explicitly. - Keep multi-step decode on the GPU. With the GPU weight cache, long generations and multi-step decode amortize a single weight upload across every step — so the longer the decode, the more the one-time upload pays off. Build with
--features metal(Apple Silicon) or--features native-cuda(NVIDIA) to enable the fused full-forward path that uses it. - Refresh your CUDA build. The
cudarcupdate improves CUDA 11.x / 12.x compatibility, and the new ternaryencode_ternarykernels extend native NVRTC coverage — rebuild yournative-cudabinary on this release to pick both up.
This is the foundation
OxiBonsai is built entirely on the COOLJAPAN ecosystem — SciRS2 for tensor and activation primitives, OxiBLAS for GEMM/GEMV math, OxiFFT (now 0.3) for optional RoPE acceleration, OxiARC for Pure Rust compression, and OxiONNX for ingesting ONNX-quantized weights — to serve PrismML’s Bonsai 1-bit and ternary models. Every default-feature dependency is Pure Rust: zero C/C++/Fortran, zero FFI. The GPU backends (metal, native-cuda) are opt-in features that bring in vendor drivers, while the default path stays sovereign top to bottom.
Repository: https://github.com/cool-japan/oxibonsai
Star the repo if you want serving infrastructure that gets faster by skipping redundant work — without ever changing the answer.
Pure Rust sovereign sub-2-bit inference is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ May 3, 2026