OxiBonsai 0.2.0 Released — Concurrent /serve, Byte-Identical CPU↔Metal, and Reproducible Images

One engine could answer one request at a time. Today it answers many — and proves, at startup, that the GPU gives exactly the same bytes as the CPU.

Today we released OxiBonsai 0.2.0 — the opener of the 0.2 series, where /serve learns to scale across concurrent requests and correctness gets locked down end to end.

No llama.cpp. No BLAS. No C, no C++, no Fortran runtime. OxiBonsai is the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU (AVX2/AVX-512/NEON/WASM SIMD), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC). Everything sits on the COOLJAPAN ecosystem: SciRS2, OxiBLAS, OxiFFT, OxiARC, OxiONNX.

Why OxiBonsai 0.2.0 matters

The 0.1 series proved the model could run anywhere. The 0.2 series is about making it run under load, without lying about it.

Until now, /serve was backed by a single InferenceEngine. One request held the engine; the next waited. The naive fix — spin up N engines — was a trap: each replica would clone the model’s token-embedding table, roughly 1.16 GB apiece, so four replicas meant nearly 4.6 GB of duplicated weights doing identical lookups. And there was a second, quieter gap: nothing at startup actually proved that the Metal path produced the same answer as the CPU path. “Probably the same” is not a guarantee you want under your inference SLA.

0.2.0 closes both. The serving path is now pooled and memory-frugal, and every startup carries a hard parity check between backends. As a bonus from the imaging side, the predecessor release earns its keep: 0.1.5 (2026-06-02) shipped the oxibonsai-image crate — the first pure-Rust FLUX.2-Klein 4B text-to-image pipeline (DiT + VAE + Text Encoder) with Metal flash-attention and the oxibonsai image CLI. 0.2.0 takes that pipeline to NVIDIA and makes its output reproducible.

Technical Deep Dive

EnginePool — concurrency without duplication. The new oxibonsai-runtime module engine_pool.rs introduces EnginePool, a Vec<InferenceEngine> fronted by an Arc<Semaphore> whose permits gate how many requests run at once. The CPU tier spins up N replicas that all share a single Arc<[f32]> token-embedding table — so instead of paying ~1.16 GB per replica, the whole pool pays it once. The GPU tier is deliberately clamped to a single replica, because the Metal backend is a process-global singleton and cannot be safely multiplied. A new build_pool_from_gguf API constructs the pool straight from a GGUF file, and the default CPU pool size is min(4, num_cpus).

CPU↔Metal byte-identical parity guard. At startup, InferenceEngine now runs the same greedy decode on both the CPU and Metal backends and asserts the outputs are byte-identical — not “close,” not “within tolerance,” identical. We validated this on a real 1.7B Ternary-Bonsai model. The payoff is conceptual as much as practical: the GPU path is no longer a separate code path you have to trust; it is provably the CPU path, accelerated.

CUDA imagen backend — a parity-first port. oxibonsai-kernels and oxibonsai-image gain a native CUDA GPU backend for image generation on Linux and Windows, authored as a blind mirror of the Metal path: plain FP32, cap-of-8-safe, purely additive — the Metal path’s bytes are unchanged. It lands in three acceleration tiers: (a) a gemm_tq2 retile from 32×32 to 128×128 transposed-shared tiles, ≈6× on that kernel; (b) warp-cooperative flash-attention splitting head_dim across lanes with __shfl_xor_sync, ≈6.3× on that kernel; and (c) a Stage-0 context_embedder GPU port (encode_gemm_f32) at 59× on that operation. End to end at steps=4 on an A4000-class GPU, this moves ~101s down to ~31.7s — about 3.2×. Honest caveat: compile and the cos≥0.999 parity validation are deferred to real Linux/CUDA hardware. This is a parity-first port, not a hardware-validated one yet.

Text Encoder, 4-bit native. oxibonsai-image adds open_mlx_4bit, which loads a native 2.1 GB MLX 4-bit safetensors file directly — replacing the previous 15 GB f32 .npy dump and pulling the real Bonsai-Image footprint down to ≈3.5 GB. A parity gate, te_parity, holds it to cos≥0.999999 against the MLX oracle. Flip it on with the OXI_TE_4BIT environment variable.

Reproducible images via Threefry. oxibonsai image now takes --seed, backed by an MLX-exact Threefry-2×32 RNG port. --seed 42 is byte-exact against the official mflux reference, so the same prompt and seed give you the same PNG, every run. The relevant env vars are OXI_DIT_GGUF, OXI_VAE_WEIGHTS, OXI_TE_4BIT, and OXI_TE_TOKENIZER_DIR.

Getting Started

Install the CLI:

cargo install oxibonsai-cli

Serve a model — now backed by the pooled engine, OpenAI-compatible REST:

oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf

Generate a reproducible image:

oxibonsai image --prompt "a tiny bonsai tree in a ceramic pot" --seed 42 --out bonsai.png

You can still drive single completions with oxibonsai run --model … --prompt "…" and hold a session with oxibonsai chat --model …, but the spotlight in 0.2.0 is the pooled serve and reproducible image --seed.

What’s New in 0.2.0

Concurrent engine pool for /serve. EnginePool over Arc<Semaphore> permits; CPU replicas share one Arc<[f32]> embedding table; GPU clamped to one replica; build_pool_from_gguf; default pool size min(4, num_cpus).
CPU↔Metal parity guard at startup. Byte-identical greedy output across backends, validated on a real 1.7B Ternary-Bonsai model.
CUDA imagen backend. Parity-first FP32 mirror of the Metal path across three tiers (gemm_tq2 ≈6×, flash-attention ≈6.3×, context_embedder 59×); ~101s → ~31.7s (3.2×) at steps=4 on A4000-class hardware. Validation pending Linux/CUDA.
4-bit Text Encoder. open_mlx_4bit loads native 2.1 GB MLX safetensors (down from 15 GB f32), ≈3.5 GB total footprint, te_parity cos≥0.999999, via OXI_TE_4BIT.
--seed reproducibility. MLX-exact Threefry-2×32 RNG; --seed 42 byte-exact vs mflux.
Stable toolchain for oxibonsai-kernels. No more nightly requirement — a build.rs nightly-detect plus an aarch64_prefetch! macro (no-op on stable, active prefetch on nightly) builds cleanly on stable 1.86+ and nightly, with zero new warnings.
Streaming temperature fix. Temperature was being silently dropped from SamplingParams on the streaming completions path (completions.rs); it is now threaded through to sampling correctly.

Tips

Tune the CPU pool to your box. The pool defaults to min(4, num_cpus) replicas. On a many-core server you can lift concurrency; on a small machine, fewer replicas keep memory pressure low.
Lean on the shared embedding table. The whole point of EnginePool is that N CPU replicas share one Arc<[f32]> table instead of N copies of ~1.16 GB. More replicas cost permits and activations, not another full embedding table.
Use --seed for repeatable images. --seed 42 is byte-exact against mflux; pin a seed when you need the exact same PNG across runs, machines, or CI.
Set OXI_TE_4BIT to shrink the encoder. It loads the native 2.1 GB MLX 4-bit text encoder directly (≈3.5 GB total footprint) instead of the 15 GB f32 path, while staying within te_parity cos≥0.999999.
Build on stable now. oxibonsai-kernels no longer needs nightly — stable 1.86+ works out of the box, prefetch and all.
NVIDIA users: try the CUDA imagen backend. On Linux or Windows, the new CUDA path accelerates image generation as a parity-first FP32 mirror of Metal (hardware parity validation pending).

This is the foundation

OxiBonsai is the inference end of the COOLJAPAN ecosystem — sub-2-bit Bonsai models from PrismML, served and rendered on top of SciRS2, OxiBLAS, OxiFFT, OxiARC, and OxiONNX, with no FFI and no C/C++/Fortran runtime underneath any of it. 0.2.0 turns that foundation into something you can put concurrent traffic on and still reason about, byte for byte.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you believe inference should be concurrent, reproducible, and sovereign — without borrowing a single line of C.

Pure Rust sub-2-bit inference that scales and proves itself is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ June 3, 2026