OxiBonsai 0.2.1 Released — Minutes-Long Numeric Tests, Now Fast (and a VAE File Fix)

The fastest way to make a heavy-numeric Rust workspace fun again is two lines in Cargo.toml — and that is the headline of this release.

Today we released OxiBonsai 0.2.1 — a quality-of-life and correctness release that makes the test suite minutes-faster, fixes a VAE precheck that wrongly rejected a valid .safetensors file, and points the docs at the correct HuggingFace asset layout.

OxiBonsai is the first Pure Rust, zero-FFI, C/C++/Fortran-free inference engine for PrismML’s sub-2-bit Bonsai models — the 1-bit (Q1_0_g128) and ternary (TQ2_0_g128) lines — and, since 0.1.5, the Bonsai-Image text-to-image pipeline. The previous release, 0.2.0, brought a concurrent engine pool for /serve and a native CUDA imagen backend. 0.2.1 is pure polish on top of that.

No llama.cpp. No BLAS. No C/C++/Fortran. Everything below is Rust, all the way down to the DEFLATE that writes the PNG.

Why OxiBonsai 0.2.1 matters

If you have ever worked on a heavy-numeric Rust workspace, you know the trap: cargo test builds with opt-level = 0, and your real numeric tests — the parity golden references, the model forward passes — run completely unoptimized. In OxiBonsai that meant the DiT-shape joint-attention CPU reference (~14.5 GFLOP) and the speculative decoder’s ~240 forward passes over a 151,936-row vocabulary took minutes, every run. That tax lands on every contributor, every CI job, every time.

0.2.1 removes it with a profile change, not a code change — and does so without touching float results. The VAE fix and the doc corrections close two papercuts that bit anyone trying to follow the Bonsai-Image getting-started path.

Technical Deep Dive

The profile opt-level trick. OxiBonsai’s Cargo.toml now sets:

[profile.test]
opt-level = 2

[profile.dev.package."*"]
opt-level = 3

[profile.test] opt-level = 2 optimizes the test binaries themselves; [profile.dev.package."*"] raises optimization for all dependencies — including workspace path-deps like oxibonsai-model and oxibonsai-kernels when they are built as dependencies of another crate’s tests — so the numeric kernels are compiled and autovectorized instead of running as scalar opt-level = 0 code. Crucially, the crate you are actively editing stays at dev opt-level = 0, so incremental compiles of your own code remain fast: you optimize the heavy dependencies once, then iterate cheaply.

The important caveat, and the reason this is safe: float results are unchanged. Rust does not enable fast-math, so a higher opt-level does not reassociate floating-point reductions. The parity gates (cosine ≥ 0.999) stay bit-stable across opt-level settings — you get the speed without paying for it in numerical drift.

VAE precheck now accepts a file (#9). The text-to-image precheck in oxibonsai-image’s src/pipeline.rs used is_dir(), so a valid .safetensors file passed via --vae / OXI_VAE_WEIGHTS was rejected with “VAE weights dir not found” before any loading — even though VaeWeights::open and the docs both accept a file. The precheck now accepts a file or a directory (is_file() || is_dir()), the error wording is corrected, the stale doc-comment is fixed, and test_issue_9_* regression tests lock the behavior in.

Pure-Rust PNG backend tracked forward. oxiarc-deflate is bumped 0.3.2 → 0.3.3 per the COOLJAPAN Latest-crates policy. It is the Pure-Rust DEFLATE backend behind Bonsai-Image’s PNG output; the substantive 0.3.3 OxiARC fixes land in sibling crates, so for OxiBonsai this is a version-tracking bump with no change to DEFLATE/PNG behavior.

Getting Started

Nothing new to install — if you are already on OxiBonsai, you are set. The change here is that the docs now point at the correct HuggingFace layout. hf download preserves the repo subfolder, so the files land under the paths you pass to the CLI:

pip install huggingface_hub      # the only non-Rust step — download only

# DiT, text encoder + tokenizer, and the bundled VAE all live in ONE repo:
hf download prism-ml/bonsai-image-ternary-4B-mlx-2bit \
    transformer-packed-mflux/diffusion_pytorch_model.safetensors \
    text_encoder-mlx-4bit/model.safetensors text_encoder-mlx-4bit/tokenizer.json \
    vae/diffusion_pytorch_model.safetensors --local-dir ./bonsai

# Convert the DiT to GGUF (Pure Rust; defaults to tq2_0_g128):
cargo run -p oxibonsai-model --example mlx_image_convert --release -- \
    ./bonsai/transformer-packed-mflux/diffusion_pytorch_model.safetensors ./bonsai-dit.gguf

Then generate an image, passing the VAE as a single .safetensors file — the thing 0.2.1 makes work:

oxibonsai image \
    --prompt "a tiny bonsai tree in a ceramic pot" --out bonsai.png \
    --dit ./bonsai-dit.gguf \
    --te  ./bonsai/text_encoder-mlx-4bit/model.safetensors \
    --vae ./bonsai/vae/diffusion_pytorch_model.safetensors \
    --seed 42 --steps 4

What’s New in 0.2.1

Optimized test / dev compile profiles. [profile.test] opt-level = 2 and [profile.dev.package."*"] opt-level = 3 make test binaries and all dependencies optimized and autovectorized — turning minutes-long numeric test runs fast, while the crate you are editing stays at opt-level = 0 for fast incremental builds. Float parity is bit-stable (no fast-math, no reduction reassociation).
VAE .safetensors file fix (#9). --vae / OXI_VAE_WEIGHTS now accepts a single .safetensors file, not just a directory; the precheck no longer rejects valid files with a “dir not found” error. Regression-tested.
Corrected HuggingFace asset paths (#8). The DiT lives under transformer-packed-mflux/; the text encoder and tokenizer ship inside the main prism-ml/bonsai-image-ternary-4B-mlx-2bit repo under text_encoder-mlx-4bit/ (the standalone prism-ml/text_encoder-mlx-4bit repo does not exist). docs/IMAGEN.md and the image-crate README now match the real hf download layout, with the bundled-vs-gated FLUX.2-dev VAE choice clarified. Verified against the live HF API.
oxiarc-deflate 0.3.2 → 0.3.3 — Latest-crates version tracking for the Pure-Rust DEFLATE/PNG backend.

Tips

Borrow the profile trick for any heavy-numeric Rust workspace. Add [profile.test] opt-level = 2 plus [profile.dev.package."*"] opt-level = 3 and your dependency kernels get autovectorized while your edit-crate stays fast to recompile. Remember the caveat that makes it safe: Rust has no fast-math, so this does not reassociate float reductions — your parity/bit-exactness checks stay stable.
Pass a single .safetensors file to --vae. As of 0.2.1 you can point --vae (or OXI_VAE_WEIGHTS) straight at vae/diffusion_pytorch_model.safetensors — no directory needed. The loader auto-detects: a .safetensors file uses the native loader, a directory selects the legacy .npy reader.
Grab everything from one repo. The DiT, the 4-bit text encoder, the tokenizer, and the bundled (non-gated) VAE all live in prism-ml/bonsai-image-ternary-4B-mlx-2bit — a single hf download fetches them all, no HuggingFace login required.
Follow docs/IMAGEN.md for the exact paths. It now matches the live HF layout end-to-end, including the bundled-VAE Option A vs. gated FLUX.2-dev Option B distinction.

This is the foundation

OxiBonsai is built entirely on the COOLJAPAN ecosystem — SciRS2 for numerics, OxiBLAS for linear algebra, OxiFFT for transforms, OxiARC (via oxiarc-deflate) for the Pure-Rust PNG path, and OxiONNX for model ingestion — serving the PrismML Bonsai sub-2-bit model family with zero FFI and no C/Fortran runtime.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you want sub-2-bit inference with a test suite that respects your time and a docs path that just works.

Pure Rust sovereign AI inference is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ June 6, 2026