OxiBonsai 0.2.2 Released — An Interactive Image REPL with Inline Terminal Rendering

Generating one image meant reloading multi-gigabyte weights from scratch. Today you load them once, type prompts, and watch the pictures appear right in your terminal.

Today we released OxiBonsai 0.2.2 — an interactive image REPL that keeps the model resident across renders and draws straight into the terminal.

No llama.cpp. No BLAS. No C, no C++, no Fortran runtime. OxiBonsai is the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU SIMD (AVX2/AVX-512/NEON/WASM), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC), all on top of SciRS2, OxiBLAS, OxiFFT, OxiARC, and OxiONNX. Even the terminal graphics encoder in this release is Pure Rust.

Why OxiBonsai 0.2.2 matters

The previous release, 0.2.1, was a quiet DX-and-hardening pass: optimized test/dev profiles and two fixes from the field — a VAE .safetensors-file path bug (#9) and clearer HuggingFace asset-path docs (#8). It made the imaging path smoother to live with. 0.2.2 changes how you actually use it.

Until now, oxibonsai image was a one-shot command: every prompt paid the full cold-start price. Loading the ternary DiT, the VAE, and the Qwen3-4B text encoder — then dequantising the encoder’s weights — is the expensive part of generating an image. Doing it once is fine. Doing it on every prompt while you’re dialing in a composition is a tax on iteration: you sit through a multi-gigabyte warm-up to find out a seed was wrong, fix the seed, and pay the warm-up again.

oxibonsai repl removes that tax. It loads everything once into a resident ImageSession, then renders as many prompts as you like against the already-warm weights. Per-prompt time collapses to just compute — text-encode, DiT sampling, VAE decode — because the load and the encoder warm-up are behind you. And on a kitty-graphics terminal like Ghostty, you never leave the shell to look at the result: the PNG is painted inline, in place.

Technical Deep Dive

ImageSession — load once, render many. The new session (oxibonsai-image, driven from oxibonsai-cli) owns the DiT, the VAE, and the text encoder for its whole lifetime. The first render pays the cold cost; every render after that reuses the resident weights. Each render returns a RenderOutcome carrying StageTimings, so you get the per-stage wall-clock split — how long text-encode, sampling, and VAE decode each took — instead of one opaque total. When you’re tuning, that breakdown tells you which stage your :steps or :size change actually moved.

Resident text encoder — TeWeights::set_resident. The text encoder is the heaviest warm-up: its dequantised f32 weights are roughly 16 GB. The one-shot CLI deliberately keeps those off the heap between forwards to preserve a low-RAM profile, re-dequantising as needed. The REPL is the opposite trade: TeWeights::set_resident(true) tells the Mlx4bit source to cache the dequantised f32 tensors across forwards, so each subsequent prompt skips re-dequantisation entirely. It is off by default — the one-shot path stays lean — and on for the REPL, where you’ve opted into trading RAM for iteration speed.

Inline images via a pure-Rust kitty graphics protocol. A new src/cli/term.rs implements the kitty graphics protocol — including a pure-Rust base64 encoder — to transmit a PNG straight into the terminal’s scrollback. kitty_supported() auto-detects a capable terminal via GHOSTTY_*, TERM, and TERM_PROGRAM. On Ghostty the rendered image shows up inline; on terminals without graphics support, the PNG falls back cleanly to a file (optionally opened in a viewer). No external image viewer, no ImageMagick, no C — the whole display path is Rust.

Byte-identical pixels from both code paths. The CHW→HWC, f32→u8 conversion at the end of the pipeline is now a single shared helper, decoded_chw_to_rgb8 (pub(crate) in oxibonsai-image/pipeline.rs), called by both text_to_image (the one-shot path) and ImageSession::render (the REPL path). Sharing the conversion guarantees the REPL produces byte-identical pixels to the one-shot command — the resident path is not a second, drifting implementation; it is the same path, kept warm.

Documented GPU flags, and a CUDA parity probe. The .env.example now documents the three image-generation GPU switches: OXI_DIT_ATTN_GPU (DiT flash-attention, default-ON on Apple Silicon), OXI_VAE_GPU (VAE convolutions, default-ON on Apple Silicon), and OXI_TE_GPU (text-encoder GPU — default-OFF, because CPU SIMD wins on Apple Silicon, though it may help on Windows/NVIDIA CUDA). On the kernels side, oxibonsai-kernels gains an isolated cuda_tq2_gemv_parity.rs probe behind cfg(feature = "cuda") for validating TQ2 GEMV output on Blackwell-class GPUs. And oxionnx-proto ticks 0.1.3 → 0.1.4.

Getting Started

Install the CLI:

cargo install oxibonsai-cli

Start a resident image REPL (model paths resolve flag → env → default, exactly like oxibonsai image):

oxibonsai repl --seed 42 --steps 4 --width 512 --height 512

Then iterate. A bare line is a prompt; :-prefixed lines are commands:

oxibonsai> :fast
oxibonsai> a tiny bonsai tree in a ceramic pot
oxibonsai> :seed 7
oxibonsai> a tiny bonsai tree in a ceramic pot
oxibonsai> :hq
oxibonsai> a tiny bonsai tree in a ceramic pot

:fast drops to a snappy 2-step 384×384 preview; once a prompt and seed look right, :hq finalizes at 8 steps and 512×512. On Ghostty each render appears inline; elsewhere it lands in a PNG. The one-shot oxibonsai image --prompt "…" --seed 42 --out bonsai.png is still there for scripts and CI — but the REPL is where 0.2.2 wants you to live while you’re composing.

What’s New in 0.2.2

oxibonsai repl — resident interactive image REPL. ImageSession loads the DiT, VAE, and text encoder once and renders many prompts without re-paying the load/dequant cost. StageTimings and RenderOutcome surface per-stage wall-clock splits. Runtime commands: :steps, :seed, :size, :fast (2-step 384×384 preview), :hq (8-step 512×512), :out, :open, :help, :quit.
TeWeights::set_resident(on: bool). Controls whether the Mlx4bit source caches dequantised f32 tensors (~16 GB) across forwards. Off by default to preserve the one-shot low-RAM profile; on for the REPL.
Kitty graphics protocol support (src/cli/term.rs). Pure-Rust base64 encoder plus inline PNG display for Ghostty; kitty_supported() auto-detects via GHOSTTY_* / TERM / TERM_PROGRAM.
GPU acceleration flags documented in .env.example. OXI_DIT_ATTN_GPU (flash-attention, default-ON on Apple Silicon), OXI_VAE_GPU (convolutions, default-ON on Apple Silicon), OXI_TE_GPU (text-encoder GPU, default-OFF — CPU SIMD wins on Apple Silicon; may help on Windows/NVIDIA CUDA).
CUDA TQ2 GEMV parity test (oxibonsai-kernels). Isolated cuda_tq2_gemv_parity.rs probe for Blackwell GPU output validation, behind cfg(feature = "cuda").
Shared decoded_chw_to_rgb8 helper. The CHW→HWC f32→u8 conversion is now shared by both text_to_image and ImageSession::render, guaranteeing byte-identical pixels from both paths.
oxionnx-proto bumped 0.1.3 → 0.1.4.

Tips

Keep the encoder resident for fast iteration. On a high-RAM machine, the REPL’s resident text encoder (TeWeights::set_resident, ~16 GB cached) means each prompt after the first skips re-dequantisation. That is exactly the trade you want when you’re cycling through prompts and seeds — RAM in exchange for not re-paying warm-up.
Work the :fast → :hq loop. Compose with :fast (2 steps, 384×384) to find the prompt and seed cheaply, then :hq (8 steps, 512×512) once to finalize. You spend your steps where they matter and skip slow renders of the wrong image.
Change :seed to explore, then lock it. :seed N re-rolls the same prompt; when one lands, keep the seed and switch to :hq. Pairs naturally with 0.2.0’s byte-exact --seed, so the finalized image is reproducible later.
Flip the GPU flags to match your platform. On Apple Silicon, OXI_DIT_ATTN_GPU and OXI_VAE_GPU are already on and OXI_TE_GPU is off (CPU SIMD wins there). On Windows/NVIDIA CUDA, try turning OXI_TE_GPU on — that’s the one switch most likely to help off the Mac.
Read StageTimings when tuning. Each RenderOutcome reports the per-stage split, so you can see whether a :steps change hit sampling or a :size change hit VAE decode — and tune the stage that’s actually costing you.
Run in Ghostty for the inline workflow. On a kitty-graphics terminal the image appears in place via the pure-Rust protocol — no external viewer, no leaving the shell. Elsewhere it falls back to a file (use :open on to pop it in a viewer).

This is the foundation

OxiBonsai is the inference end of the COOLJAPAN ecosystem — sub-2-bit Bonsai models from PrismML, served and rendered on top of SciRS2, OxiBLAS, OxiFFT, OxiARC, and OxiONNX, with no FFI and no C/C++/Fortran runtime anywhere underneath. 0.2.2 extends that all the way to the terminal: the model is resident, the iteration loop is interactive, and even the pixels reach your screen through Pure Rust.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you believe generating an image should be an interactive loop in your own terminal — fast, reproducible, and sovereign, without a line of C.

Pure Rust sub-2-bit image generation that loads once and draws inline is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ June 8, 2026