COOLJAPAN
← All posts

OxiBonsai 0.2.0 Released — Concurrent /serve, Byte-Identical CPU↔Metal, and Reproducible Images

OxiBonsai 0.2.0 opens the 0.2 series: a concurrent engine pool that shares one 1.16 GB embedding table across replicas, a CPU↔Metal byte-identical parity guard, a parity-first CUDA imagen backend (~3.2× to ~31.7s on A4000), --seed byte-exact reproducible images, and a stable-toolchain build — sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem.

release oxibonsai llm inference pure-rust quantization serving concurrency reproducibility cuda metal

One engine could answer one request at a time. Today it answers many — and proves, at startup, that the GPU gives exactly the same bytes as the CPU.

Today we released OxiBonsai 0.2.0 — the opener of the 0.2 series, where /serve learns to scale across concurrent requests and correctness gets locked down end to end.

No llama.cpp. No BLAS. No C, no C++, no Fortran runtime. OxiBonsai is the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU (AVX2/AVX-512/NEON/WASM SIMD), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC). Everything sits on the COOLJAPAN ecosystem: SciRS2, OxiBLAS, OxiFFT, OxiARC, OxiONNX.

Why OxiBonsai 0.2.0 matters

The 0.1 series proved the model could run anywhere. The 0.2 series is about making it run under load, without lying about it.

Until now, /serve was backed by a single InferenceEngine. One request held the engine; the next waited. The naive fix — spin up N engines — was a trap: each replica would clone the model’s token-embedding table, roughly 1.16 GB apiece, so four replicas meant nearly 4.6 GB of duplicated weights doing identical lookups. And there was a second, quieter gap: nothing at startup actually proved that the Metal path produced the same answer as the CPU path. “Probably the same” is not a guarantee you want under your inference SLA.

0.2.0 closes both. The serving path is now pooled and memory-frugal, and every startup carries a hard parity check between backends. As a bonus from the imaging side, the predecessor release earns its keep: 0.1.5 (2026-06-02) shipped the oxibonsai-image crate — the first pure-Rust FLUX.2-Klein 4B text-to-image pipeline (DiT + VAE + Text Encoder) with Metal flash-attention and the oxibonsai image CLI. 0.2.0 takes that pipeline to NVIDIA and makes its output reproducible.

Technical Deep Dive

EnginePool — concurrency without duplication. The new oxibonsai-runtime module engine_pool.rs introduces EnginePool, a Vec<InferenceEngine> fronted by an Arc<Semaphore> whose permits gate how many requests run at once. The CPU tier spins up N replicas that all share a single Arc<[f32]> token-embedding table — so instead of paying ~1.16 GB per replica, the whole pool pays it once. The GPU tier is deliberately clamped to a single replica, because the Metal backend is a process-global singleton and cannot be safely multiplied. A new build_pool_from_gguf API constructs the pool straight from a GGUF file, and the default CPU pool size is min(4, num_cpus).

CPU↔Metal byte-identical parity guard. At startup, InferenceEngine now runs the same greedy decode on both the CPU and Metal backends and asserts the outputs are byte-identical — not “close,” not “within tolerance,” identical. We validated this on a real 1.7B Ternary-Bonsai model. The payoff is conceptual as much as practical: the GPU path is no longer a separate code path you have to trust; it is provably the CPU path, accelerated.

CUDA imagen backend — a parity-first port. oxibonsai-kernels and oxibonsai-image gain a native CUDA GPU backend for image generation on Linux and Windows, authored as a blind mirror of the Metal path: plain FP32, cap-of-8-safe, purely additive — the Metal path’s bytes are unchanged. It lands in three acceleration tiers: (a) a gemm_tq2 retile from 32×32 to 128×128 transposed-shared tiles, ≈6× on that kernel; (b) warp-cooperative flash-attention splitting head_dim across lanes with __shfl_xor_sync, ≈6.3× on that kernel; and (c) a Stage-0 context_embedder GPU port (encode_gemm_f32) at 59× on that operation. End to end at steps=4 on an A4000-class GPU, this moves ~101s down to ~31.7s — about 3.2×. Honest caveat: compile and the cos≥0.999 parity validation are deferred to real Linux/CUDA hardware. This is a parity-first port, not a hardware-validated one yet.

Text Encoder, 4-bit native. oxibonsai-image adds open_mlx_4bit, which loads a native 2.1 GB MLX 4-bit safetensors file directly — replacing the previous 15 GB f32 .npy dump and pulling the real Bonsai-Image footprint down to ≈3.5 GB. A parity gate, te_parity, holds it to cos≥0.999999 against the MLX oracle. Flip it on with the OXI_TE_4BIT environment variable.

Reproducible images via Threefry. oxibonsai image now takes --seed, backed by an MLX-exact Threefry-2×32 RNG port. --seed 42 is byte-exact against the official mflux reference, so the same prompt and seed give you the same PNG, every run. The relevant env vars are OXI_DIT_GGUF, OXI_VAE_WEIGHTS, OXI_TE_4BIT, and OXI_TE_TOKENIZER_DIR.

Getting Started

Install the CLI:

cargo install oxibonsai-cli

Serve a model — now backed by the pooled engine, OpenAI-compatible REST:

oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf

Generate a reproducible image:

oxibonsai image --prompt "a tiny bonsai tree in a ceramic pot" --seed 42 --out bonsai.png

You can still drive single completions with oxibonsai run --model … --prompt "…" and hold a session with oxibonsai chat --model …, but the spotlight in 0.2.0 is the pooled serve and reproducible image --seed.

What’s New in 0.2.0

Tips

This is the foundation

OxiBonsai is the inference end of the COOLJAPAN ecosystem — sub-2-bit Bonsai models from PrismML, served and rendered on top of SciRS2, OxiBLAS, OxiFFT, OxiARC, and OxiONNX, with no FFI and no C/C++/Fortran runtime underneath any of it. 0.2.0 turns that foundation into something you can put concurrent traffic on and still reason about, byte for byte.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you believe inference should be concurrent, reproducible, and sovereign — without borrowing a single line of C.

Pure Rust sub-2-bit inference that scales and proves itself is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ June 3, 2026

↑ Back to all posts