One engine could answer one request at a time. Today it answers many — and proves, at startup, that the GPU gives exactly the same bytes as the CPU.
Today we released OxiBonsai 0.2.0 — the opener of the 0.2 series, where /serve learns to scale across concurrent requests and correctness gets locked down end to end.
No llama.cpp. No BLAS. No C, no C++, no Fortran runtime. OxiBonsai is the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU (AVX2/AVX-512/NEON/WASM SIMD), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC). Everything sits on the COOLJAPAN ecosystem: SciRS2, OxiBLAS, OxiFFT, OxiARC, OxiONNX.
Why OxiBonsai 0.2.0 matters
The 0.1 series proved the model could run anywhere. The 0.2 series is about making it run under load, without lying about it.
Until now, /serve was backed by a single InferenceEngine. One request held the engine; the next waited. The naive fix — spin up N engines — was a trap: each replica would clone the model’s token-embedding table, roughly 1.16 GB apiece, so four replicas meant nearly 4.6 GB of duplicated weights doing identical lookups. And there was a second, quieter gap: nothing at startup actually proved that the Metal path produced the same answer as the CPU path. “Probably the same” is not a guarantee you want under your inference SLA.
0.2.0 closes both. The serving path is now pooled and memory-frugal, and every startup carries a hard parity check between backends. As a bonus from the imaging side, the predecessor release earns its keep: 0.1.5 (2026-06-02) shipped the oxibonsai-image crate — the first pure-Rust FLUX.2-Klein 4B text-to-image pipeline (DiT + VAE + Text Encoder) with Metal flash-attention and the oxibonsai image CLI. 0.2.0 takes that pipeline to NVIDIA and makes its output reproducible.
Technical Deep Dive
EnginePool — concurrency without duplication. The new oxibonsai-runtime module engine_pool.rs introduces EnginePool, a Vec<InferenceEngine> fronted by an Arc<Semaphore> whose permits gate how many requests run at once. The CPU tier spins up N replicas that all share a single Arc<[f32]> token-embedding table — so instead of paying ~1.16 GB per replica, the whole pool pays it once. The GPU tier is deliberately clamped to a single replica, because the Metal backend is a process-global singleton and cannot be safely multiplied. A new build_pool_from_gguf API constructs the pool straight from a GGUF file, and the default CPU pool size is min(4, num_cpus).
CPU↔Metal byte-identical parity guard. At startup, InferenceEngine now runs the same greedy decode on both the CPU and Metal backends and asserts the outputs are byte-identical — not “close,” not “within tolerance,” identical. We validated this on a real 1.7B Ternary-Bonsai model. The payoff is conceptual as much as practical: the GPU path is no longer a separate code path you have to trust; it is provably the CPU path, accelerated.
CUDA imagen backend — a parity-first port. oxibonsai-kernels and oxibonsai-image gain a native CUDA GPU backend for image generation on Linux and Windows, authored as a blind mirror of the Metal path: plain FP32, cap-of-8-safe, purely additive — the Metal path’s bytes are unchanged. It lands in three acceleration tiers: (a) a gemm_tq2 retile from 32×32 to 128×128 transposed-shared tiles, ≈6× on that kernel; (b) warp-cooperative flash-attention splitting head_dim across lanes with __shfl_xor_sync, ≈6.3× on that kernel; and (c) a Stage-0 context_embedder GPU port (encode_gemm_f32) at 59× on that operation. End to end at steps=4 on an A4000-class GPU, this moves ~101s down to ~31.7s — about 3.2×. Honest caveat: compile and the cos≥0.999 parity validation are deferred to real Linux/CUDA hardware. This is a parity-first port, not a hardware-validated one yet.
Text Encoder, 4-bit native. oxibonsai-image adds open_mlx_4bit, which loads a native 2.1 GB MLX 4-bit safetensors file directly — replacing the previous 15 GB f32 .npy dump and pulling the real Bonsai-Image footprint down to ≈3.5 GB. A parity gate, te_parity, holds it to cos≥0.999999 against the MLX oracle. Flip it on with the OXI_TE_4BIT environment variable.
Reproducible images via Threefry. oxibonsai image now takes --seed, backed by an MLX-exact Threefry-2×32 RNG port. --seed 42 is byte-exact against the official mflux reference, so the same prompt and seed give you the same PNG, every run. The relevant env vars are OXI_DIT_GGUF, OXI_VAE_WEIGHTS, OXI_TE_4BIT, and OXI_TE_TOKENIZER_DIR.
Getting Started
Install the CLI:
cargo install oxibonsai-cli
Serve a model — now backed by the pooled engine, OpenAI-compatible REST:
oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf
Generate a reproducible image:
oxibonsai image --prompt "a tiny bonsai tree in a ceramic pot" --seed 42 --out bonsai.png
You can still drive single completions with oxibonsai run --model … --prompt "…" and hold a session with oxibonsai chat --model …, but the spotlight in 0.2.0 is the pooled serve and reproducible image --seed.
What’s New in 0.2.0
- Concurrent engine pool for
/serve.EnginePooloverArc<Semaphore>permits; CPU replicas share oneArc<[f32]>embedding table; GPU clamped to one replica;build_pool_from_gguf; default pool sizemin(4, num_cpus). - CPU↔Metal parity guard at startup. Byte-identical greedy output across backends, validated on a real 1.7B Ternary-Bonsai model.
- CUDA imagen backend. Parity-first FP32 mirror of the Metal path across three tiers (gemm_tq2 ≈6×, flash-attention ≈6.3×, context_embedder 59×); ~101s → ~31.7s (3.2×) at steps=4 on A4000-class hardware. Validation pending Linux/CUDA.
- 4-bit Text Encoder.
open_mlx_4bitloads native 2.1 GB MLX safetensors (down from 15 GB f32), ≈3.5 GB total footprint,te_paritycos≥0.999999, viaOXI_TE_4BIT. --seedreproducibility. MLX-exact Threefry-2×32 RNG;--seed 42byte-exact vs mflux.- Stable toolchain for
oxibonsai-kernels. No more nightly requirement — abuild.rsnightly-detect plus anaarch64_prefetch!macro (no-op on stable, active prefetch on nightly) builds cleanly on stable 1.86+ and nightly, with zero new warnings. - Streaming temperature fix. Temperature was being silently dropped from
SamplingParamson the streaming completions path (completions.rs); it is now threaded through to sampling correctly.
Tips
- Tune the CPU pool to your box. The pool defaults to
min(4, num_cpus)replicas. On a many-core server you can lift concurrency; on a small machine, fewer replicas keep memory pressure low. - Lean on the shared embedding table. The whole point of
EnginePoolis that N CPU replicas share oneArc<[f32]>table instead of N copies of ~1.16 GB. More replicas cost permits and activations, not another full embedding table. - Use
--seedfor repeatable images.--seed 42is byte-exact against mflux; pin a seed when you need the exact same PNG across runs, machines, or CI. - Set
OXI_TE_4BITto shrink the encoder. It loads the native 2.1 GB MLX 4-bit text encoder directly (≈3.5 GB total footprint) instead of the 15 GB f32 path, while staying withinte_paritycos≥0.999999. - Build on stable now.
oxibonsai-kernelsno longer needs nightly — stable 1.86+ works out of the box, prefetch and all. - NVIDIA users: try the CUDA imagen backend. On Linux or Windows, the new CUDA path accelerates image generation as a parity-first FP32 mirror of Metal (hardware parity validation pending).
This is the foundation
OxiBonsai is the inference end of the COOLJAPAN ecosystem — sub-2-bit Bonsai models from PrismML, served and rendered on top of SciRS2, OxiBLAS, OxiFFT, OxiARC, and OxiONNX, with no FFI and no C/C++/Fortran runtime underneath any of it. 0.2.0 turns that foundation into something you can put concurrent traffic on and still reason about, byte for byte.
Repository: https://github.com/cool-japan/oxibonsai
Star the repo if you believe inference should be concurrent, reproducible, and sovereign — without borrowing a single line of C.
Pure Rust sub-2-bit inference that scales and proves itself is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ June 3, 2026