COOLJAPAN
← All posts

OxiBonsai 0.1.3 Released — Prefix-Cache-Aware Serving with Byte-Identical Warm Paths

OxiBonsai 0.1.3 makes sub-2-bit serving smarter: a prefix-cache-aware engine that reuses KV-cache across requests with byte-identical cold/warm parity, runtime tokenizer auto-detection, and a GPU weight cache that uploads once. Sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem.

release oxibonsai llm inference pure-rust quantization kv-cache serving tokenizer cuda metal

A 1.7B ternary model is only as fast as the work you can skip — so 0.1.3 teaches OxiBonsai to skip the parts it has already done.

Today we released OxiBonsai 0.1.3 — a serving-efficiency release that adds a prefix-cache-aware inference engine, runtime tokenizer auto-detection, and a GPU weight cache, so repeated requests reuse work instead of redoing it.

No llama.cpp. No BLAS. No C, no C++, no Fortran, no FFI. OxiBonsai is the Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU (AVX2 / AVX-512 / NEON / WASM SIMD), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC). The previous release, 0.1.2, opened the door to ONNX-quantized weights by letting oxibonsai convert --onnx ingest onnx-community Ternary models (MatMulNBits, bits=2) and repack them as GGUF. 0.1.3 turns its attention from getting models in to serving them well.

Why OxiBonsai 0.1.3 matters

The fastest token is the one you never have to compute. Chat and agent workloads spend a surprising fraction of their compute re-processing the same prefix over and over: a shared system prompt, a tool schema, a few-shot preamble. Every request re-runs prefill across those identical tokens before it gets to the part that actually differs.

0.1.3 closes that gap with a prefix-cache-aware inference engine. When a new request shares a prefix with a cached one, the engine reuses the existing KV-cache for the common span and only computes the suffix. The hard part is not the speedup — it’s making sure the speedup is free of side effects. A cache that produces even slightly different logits than a cold run is a correctness bug waiting to surface in production. So this release ships with a guarantee, not just an optimization: the cached path matches the cold-cache path byte-for-byte.

Two more pieces of friction get removed in the same release. Tokenizer auto-detection means OxiBonsai now figures out at runtime whether a model ships a SentencePiece tokenizer or a HuggingFace tokenizers JSON, and bridges to the right one — no manual flag, no guessing. And on the GPU side, a new weight cache uploads each layer’s weights to the device once and reuses them across every decode step, instead of re-staging them token after token.

Technical Deep Dive

PrefixCachedEngine — KV reuse with cold/warm parity. The new engine in oxibonsai-runtime tracks cached prefixes and serves a matching request by replaying the stored KV-cache for the shared span, then running prefill only over the divergent suffix. The release’s headline correctness fix is exactly here: the cached path’s output now matches the cold-cache path byte-for-byte in tests. That parity is what makes prefix caching safe to leave on — a warm hit and a cold miss for the same final prompt produce identical tokens, so caching never changes what the model says, only how fast it says it. The server and engine were refactored to thread the prefix-cache plumbing through the request path.

Tokenizer auto-detection + bridge. OxiBonsai detects the tokenizer flavor at runtime — SentencePiece versus HF tokenizers — and routes through a tokenizer bridge so the rest of the stack stays agnostic. oxibonsai-tokenizer got upgrades to support this, plus new CLI tokenizer-management commands. Pairing the detector with the existing tokenizer download flow means you can fetch and use a model’s tokenizer without telling OxiBonsai which kind it is.

GPU weight cache for BonsaiModel. Both the Metal and CUDA full-forward paths now share a gpu_cache.rs module that holds device-resident weights. Weights are uploaded a single time and reused across decode steps, so per-token decode stops paying repeated host→device transfer cost. To support this cleanly, the full-forward layer was split into dedicated forward_metal.rs and forward_cuda.rs modules with the shared cache between them.

CUDA ternary encode kernels. oxibonsai-kernels gains CUDA ternary (TQ2) encoding kernels along the encode_ternary path, plus a full-forward CUDA fused layer — extending the native NVRTC backend’s ternary coverage. This is the same fused-full-forward design that drives OxiBonsai’s GPU throughput: a single command buffer per token rather than one submission per GEMV. On a Ternary-Bonsai-1.7B model, the fused paths land around ~50 tok/s on Apple Silicon Metal and ~21.9 tok/s on an RTX 3060 with CUDA.

Plumbing and policy. This release bumps the oxifft workspace dependency to 0.3 and updates the CUDA dependency (cudarc) for cleaner CUDA 11.x / 12.x compatibility. The download_ternary.sh script now downloads shards in parallel with clearer error messages, and .gitattributes enforces LF line endings on shell scripts. Internally, oxibonsai-model::model::types (1857 lines) was refactored into a types/ directory — mod.rs, forward_cuda.rs, forward_metal.rs, gpu_cache.rs — keeping every source file under the 2000-line ceiling.

Getting Started

cargo install oxibonsai-cli           # installs the `oxibonsai` binary (Rust 1.86+)

Grab a model and its tokenizer, then start the OpenAI-compatible server:

# Fetch + convert a ternary model to GGUF (also downloads the tokenizer)
./scripts/download_ternary.sh 1.7b

# Serve it — repeated requests that share a prefix reuse the cached KV-cache
oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf --host 127.0.0.1 --port 8080

Send two chat completions that share the same system prompt and watch the second one skip the shared prefill:

curl -s http://127.0.0.1:8080/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "messages": [
    {"role": "system", "content": "You are a terse assistant for the COOLJAPAN ecosystem."},
    {"role": "user", "content": "Explain ternary quantization in one sentence."}
  ]
}'
# Re-send with the SAME system prompt and a new user turn — the shared prefix is reused.

No tokenizer flag is required: OxiBonsai auto-detects SentencePiece vs. HF tokenizers at load time.

What’s New in 0.1.3

Tips

This is the foundation

OxiBonsai is built entirely on the COOLJAPAN ecosystem — SciRS2 for tensor and activation primitives, OxiBLAS for GEMM/GEMV math, OxiFFT (now 0.3) for optional RoPE acceleration, OxiARC for Pure Rust compression, and OxiONNX for ingesting ONNX-quantized weights — to serve PrismML’s Bonsai 1-bit and ternary models. Every default-feature dependency is Pure Rust: zero C/C++/Fortran, zero FFI. The GPU backends (metal, native-cuda) are opt-in features that bring in vendor drivers, while the default path stays sovereign top to bottom.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you want serving infrastructure that gets faster by skipping redundant work — without ever changing the answer.

Pure Rust sovereign sub-2-bit inference is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ May 3, 2026

↑ Back to all posts