OxiBonsai 0.1.5 Released — OxiBonsai Goes Multimodal: a Pure-Rust FLUX.2-Klein Text-to-Image Pipeline

OxiBonsai stopped being a thing that only writes words. Today it draws.

Today we released OxiBonsai 0.1.5 — introducing the new oxibonsai-image crate, a complete Pure-Rust text-to-image pipeline for PrismML Bonsai-Image (FLUX.2-Klein 4B) that turns a prompt into a PNG without a single line of Python at runtime.

No PyTorch. No diffusers. No Python at inference time. No C, no C++, no Fortran. No llama.cpp. No BLAS. Even the PNG encoder is Pure Rust, written on top of OxiARC.

OxiBonsai is the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family — the 1-bit Q1_0_g128 line and the ternary TQ2_0_g128 line — running on CPU SIMD, Apple Silicon Metal, and NVIDIA CUDA. Until now it was an LLM-only engine; 0.1.4 hardened it into a production runtime with controllers, observability, K-quant/FP8 families, CUDA batch prefill, and constrained decoding. 0.1.5 makes it multimodal.

Why OxiBonsai 0.1.5 matters

Generating an image today means standing up a stack: PyTorch, diffusers, a CUDA C++ kernel library, a Python interpreter, and a pile of native wheels that fight your toolchain. The model is small; the dependency tree is enormous.

OxiBonsai 0.1.5 collapses all of that into one Pure-Rust binary. The oxibonsai-image crate is the first pure-Rust, C/C++/Fortran-free, zero-FFI implementation of the Bonsai-Image FLUX.2-Klein text-to-image pipeline, built entirely on the COOLJAPAN ecosystem. Every stage — text encoder, diffusion transformer, VAE decoder — is parity-validated against the MLX reference at cos≥0.999. It runs on Apple Silicon Metal by default, falls back cleanly to CPU, and never asks for Python at runtime.

Technical Deep Dive

The whole pipeline is a single straight line from prompt to pixels:

prompt
  │
  ▼  Text Encoder ── Qwen3-4B, 4-bit (open_mlx_4bit)
  │
  ▼  DiT ── FLUX.2-Klein ternary, TQ2_0_g128 (Flux2Transformer2DModel)
  │
  ▼  VAE decoder ── AutoencoderKLFlux2 (Pure-Rust Conv2d)
  │
  ▼  PNG ── oxiarc-deflate (Pure-Rust DEFLATE)
  │
  ▼  out.png

Every model stage in that line — text encoder, DiT, and VAE — is independently parity-validated at cos≥0.999 against the MLX reference. Nothing here is a Pure-Rust approximation that quietly drifts; each block is held to the original numerics.

The DiT — `Flux2Transformer2DModel`

The diffusion transformer is the heart of the generator: 5 double-stream blocks followed by 20 single-stream blocks, all carrying ternary TQ2_0_g128 weights.

Ternary is the whole point here. A diffusion transformer is normally a multi-gigabyte fp16 model. Carrying its weights at sub-2-bit ternary is what lets Bonsai-Image fit alongside the text encoder inside a real ~3.5 GB footprint instead of a workstation’s worth of VRAM.

Latents are 128-channel. The block internals are a faithful FLUX.2-Klein port:

4-axis RoPE for position encoding across the latent grid.
AdaLN modulation to inject the conditioning signal.
A fused gate+up+SwiGLU MLP.
Joint attention across the image and text streams in the double blocks, single-stream attention thereafter.

Two kernels make it fast on Apple Silicon:

Ternary GEMM v10 (in oxibonsai-kernels) uses f16-D-EXACT staging — each code maps to a value in {−s, 0, +s} — with a vectorized dequant-scatter. It runs ~1.89× over the v9 kernel end-to-end on the DiT under Metal.
A dedicated flash-attention Metal kernel (dit_attention_flash.rs) does simdgroup f32 MACs with a flash-v2 online softmax. It clocks 59ms vs 323ms per step against the CPU rayon+NEON reference — a 5.47× speedup — and it is default-on for DiT inference.

Correctness is pinned by the dit_parity gate: 59 taps, all required to hold cos≥0.999 against the reference. That matters because ternary weights are exactly where a Pure-Rust port could silently lose precision; the 59-tap gate makes that failure mode impossible to ship.

The VAE decoder — `AutoencoderKLFlux2`

The decoder is a from-scratch Pure-Rust convolutional network. Conv2d is implemented as im2col + GEMM, and the rest of the stack is built up layer by layer:

GroupNorm(32) and SiLU activations.
ResNet blocks with layers_per_block=2.
A 4-stage upsample with channel widths 128 / 256 / 512 / 512.
A final post_quant_conv and a patch[2,2] unpack, running entirely in fp32 (force_upcast).

This is the stage that turns latents back into pixels, so it runs in full fp32 and is held to the same parity bar as the rest of the pipeline.

On Metal the VAE is default-on and drops decode time from 22.5s to 6.9s — about 3.2× over CPU. An implicit-GEMM, im2col-free convolution (vae_conv_implicit.rs, routed in encode_conv2d_f32 for kernels k≥3) trimmed it further, from 9.1s to 6.9s. You can opt out with OXI_VAE_GPU=0. The vae_parity gate holds 11 taps at cos≥0.999.

A native VAE safetensors loader (src/vae/safetensors.rs) reads FLUX.2 .safetensors directly: bf16→f32 is lossless (a pure bit-shift, zero rounding), and conv weights are transposed [O,I,kH,kW]→[O,kH,kW,I] on load. That eliminates the old Python .npy export step entirely.

Operationally, that is a real simplification. The old path needed Python and a conversion step just to stage the weights; now the engine ingests the original checkpoint as shipped, and there is no offline export to keep in sync.

The text encoder — Qwen3-4B, 4-bit

The prompt encoder is Qwen3-4B loaded through open_mlx_4bit. It reads native 2.1 GB MLX 4-bit safetensors directly — down from a 15 GB f32 .npy dump — and dequantizes the mlx-packed-affine 4-bit weights to f32 on demand.

The real Bonsai-Image footprint lands at roughly 3.5 GB. Activate it with OXI_TE_4BIT.

The te_parity gate is the strictest in the pipeline: cos≥0.999999 against the MLX oracle. The text encoder sets the conditioning for everything downstream, so a near-bit-exact match here is what keeps the rest of the pipeline faithful to the reference.

Scheduler, RNG, and PNG

Sampling uses a flow-match Euler scheduler with dynamic μ-shift (sequence-length-dependent exponential time-shift), native init noise, img_ids/txt_ids, and sigmas/timesteps generation. Noise comes from an MLX-exact Threefry-2×32 RNG port (src/sample/mlx_rng.rs): 5 rounds, exact rotation constants, per-round key-inject — with --seed 42 it byte-matches the official mflux reference.

The final image is written by oxiarc-deflate, a Pure-Rust DEFLATE encoder from the COOLJAPAN ecosystem (no flate2, no zstd, no zip), and the output is parity-validated against a reference PNG at 512×512.

A CUDA imagen backend also lands here for Linux/Windows, authored as a blind mirror of the Metal path — parity-first plain-FP32, additive, leaving the Metal bytes unchanged. An early steps=4 benchmark on an A4000-class GPU projects ~101s → ~31.7s (3.2×); full compile and cos≥0.999 CUDA parity validation are deferred to CUDA hardware, so treat this as the backend landing rather than a GA number. CUDA imagen GA is the story for a later release.

Getting Started

cargo install oxibonsai-cli           # installs the `oxibonsai` binary

# Configure the imagen assets once (copy the template, then edit paths):
cp .env.example .env
#   OXI_DIT_GGUF=...         # FLUX.2-Klein ternary DiT (GGUF, TQ2_0_g128)
#   OXI_VAE_WEIGHTS=...      # AutoencoderKLFlux2 .safetensors (or a .npy dir)
#   OXI_TE_4BIT=...          # Qwen3-4B text encoder, 2.1 GB MLX 4-bit safetensors
#   OXI_TE_TOKENIZER_DIR=... # Qwen3 tokenizer dir

# Generate a 512×512 PNG (Metal default-on; ~52–62s on an M3, steps=4):
oxibonsai image --prompt "a tiny bonsai tree in a ceramic pot" --out bonsai.png

docs/IMAGEN.md walks through fetching the checkpoints from HuggingFace — the only non-Rust step in the whole flow, and it is used purely to download weights. All conversion and all inference are Pure Rust.

What’s New in 0.1.5

Here is everything that landed in this release:

New oxibonsai-image crate: the first Pure-Rust, zero-FFI FLUX.2-Klein text-to-image pipeline.
DiT Flux2Transformer2DModel: 5 double + 20 single blocks, ternary TQ2_0_g128, 128-channel latents, 4-axis RoPE, AdaLN, fused SwiGLU; dit_parity 59 taps at cos≥0.999.
VAE decoder AutoencoderKLFlux2 with Pure-Rust Conv2d, plus Metal GPU VAE (default-on) at ~3.2× and an implicit-GEMM conv path.
Native VAE safetensors loader: reads FLUX.2 .safetensors directly, lossless bf16→f32, no Python .npy export.
Qwen3-4B 4-bit text encoder (open_mlx_4bit): 2.1 GB MLX 4-bit weights, ~3.5 GB footprint, te_parity cos≥0.999999.
Flash-attention Metal kernel (dit_attention_flash.rs): 5.47× over the CPU reference, default-on.
Ternary GEMM v10 kernel: f16-D-EXACT staging + vectorized dequant-scatter, ~1.89× over v9.
Flow-match Euler scheduler with dynamic μ-shift, plus the MLX-exact Threefry-2×32 RNG.
Pure-Rust PNG output via oxiarc-deflate, parity-validated at 512×512.
.env / dotenvy auto-load on every CLI subcommand; precedence CLI flag > shell env > .env > default.
New docs: docs/IMAGEN.md (assets, env, CLI, perf, parity) and docs/CLI.md (exhaustive flag/env reference).
oxibonsai-kernels now builds cleanly on stable Rust 1.86+ and nightly; build.rs gates the aarch64 prefetch intrinsic behind cfg(nightly_aarch64_prefetch).
Fixes: read head_dim from GGUF metadata for Qwen3-4B (it sets head_dim=128 explicitly, while 2560/32=80 — deriving it caused a LinearTernary shape mismatch loading Ternary-Bonsai-4B.gguf); the temperature-discard bug on streaming completions; and hf download arg order in scripts.

Tips

Metal is default-on. When debugging — or on a machine without Metal — opt out per stage: OXI_VAE_GPU=0 runs the VAE on CPU, and OXI_DIT_ATTN_GPU=0 forces the CPU flash-attention fallback for the DiT.
Set OXI_TE_4BIT to the 2.1 GB MLX 4-bit safetensors so the text encoder loads ~3.5 GB instead of the old 15 GB f32 .npy.
Point OXI_VAE_WEIGHTS straight at the FLUX.2 .safetensors file — the native loader reads it directly and skips the Python .npy export step entirely.
Drop a .env in your project (or any parent directory) so you can omit the flags on every run; precedence is CLI flag > shell env > .env > built-in default.
Keep steps low (e.g. 4) for fast iteration. Seeds exist (--seed, backed by the MLX-exact Threefry RNG) if you want to revisit a composition — full byte-exact reproducibility lands in a later release, so lean on it lightly for now.
Read docs/IMAGEN.md for end-to-end asset acquisition and the full perf/parity tables before your first generation.

This is the foundation

OxiBonsai rides on the COOLJAPAN ecosystem: SciRS2 for the numerics, OxiBLAS for the linear algebra, OxiFFT, OxiARC (which powers the Pure-Rust PNG/DEFLATE encoder), and OxiONNX. It already served PrismML’s sub-2-bit Bonsai LLMs; with 0.1.5 it also serves PrismML Bonsai-Image. One Pure-Rust engine now spans both text and image — multimodal sovereign inference, no Python in the loop, no FFI under the hood.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if a single Pure-Rust binary that both writes and draws is the future you want to build on.

Pure Rust sovereign inference — for words and for pixels — is here: fast, safe, sovereign, and now multimodal.

— KitaSan at COOLJAPAN OÜ June 2, 2026