COOLJAPAN
← All posts

OxiBonsai 0.1.1 Released — Sub-2-Bit Inference Goes GPU, and the Ternary Line Lands

Five days after its 1-bit debut, OxiBonsai grows GPUs: a native CUDA NVRTC backend (~21.9 tok/s on Ternary-Bonsai-1.7B, RTX 3060) and a fused Metal full-forward path (~50 tok/s, ~13x speedup) — plus the new ternary TQ2_0_g128 quant family, with NEON/AVX2/AVX-512 GEMV so it flies on CPU too. Sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem, still with no llama.cpp, no BLAS, no C/Fortran.

release oxibonsai llm inference pure-rust quantization cuda metal gpu ternary

The bonsai just sprouted a GPU branch — and a second quant family to grow on.

Today we released OxiBonsai 0.1.1 — GPU acceleration arrives via a native CUDA NVRTC backend and a fused Metal full-forward path, and the new ternary TQ2_0_g128 quant line lands alongside the original 1-bit models.

If you missed the debut: OxiBonsai (オキシ盆栽) is the Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai family — quantizing 8B-class Qwen3 weights down past two bits and running them with no external runtime. The 0.1.0 release proved the 1-bit Q1_0_g128 line works end to end on CPU. This one makes it fast.

No llama.cpp. No BLAS. No C/C++/Fortran runtime.
And now that reach extends all the way to the accelerator — native CUDA via NVRTC, native Metal on Apple Silicon — still with no cuBLAS, no precompiled .cu blobs, no C library underneath any of it.
Everything compiles to a single static binary.

Why OxiBonsai 0.1.1 matters

The debut answered a question: can you really run a sub-2-bit LLM with nothing but safe Rust under it? Yes — on CPU, end to end, no FFI.

This release answers the next two: can it be fast, and can you trade a little memory for a little more quality? Also yes.

GPU acceleration is the headline. The fused Metal path lifts Ternary-Bonsai-1.7B from a NEON CPU baseline of roughly 7–8 tok/s to around 50 tok/s (best observed ~57) — a ~13x speedup — by collapsing an entire token’s worth of work into a single GPU command buffer. On NVIDIA, the native CUDA NVRTC backend runs the same model at ~21.9 tok/s on an RTX 3060 (CUDA 12.8), compiling its own kernels at runtime with no cuBLAS in sight.

The second answer is the ternary line. TQ2_0_g128 packs weights into {-1, 0, +1} at ~1.585 bits each — a hair under two bits, a hair over one — buying a few extra benchmark points over the 1-bit models for a modest memory cost. Same architecture, same runtime, same tokenizer, same server. Just point --model at a different file.

Technical Deep Dive

Two quant families now ship, side by side, both over the same Qwen3 backbone (GQA, SwiGLU, RoPE, RMSNorm) — which is why the runtime, tokenizer, and server stay identical across them:

Native CUDA via NVRTC. The CUDA backend uses NVRTC — NVIDIA’s runtime compiler — to build its kernels on the fly. There are no precompiled .cu blobs shipped in the crate, and there is no cuBLAS dependency: the fused full-forward path for both Q1 and TQ2 is hand-written and compiled at launch. On an RTX 3060 with CUDA 12.8, Ternary-Bonsai-1.7B runs at ~21.9 tok/s.

Fused Metal full-forward. The Metal path is where the ~13x speedup lives. Instead of submitting one GPU command per GEMV — the death-by-a-thousand-submissions pattern — 0.1.1 encodes an entire token’s forward pass into a single command buffer: roughly 14 dispatches per layer × N layers, all in one submission. Per layer that chain is RMSNorm → fused QKV GEMV → QK-norm + RoPE → KV-store → batched attention → attention-output GEMV + residual → FFN RMSNorm → gate + up GEMV → SwiGLU → down GEMV + residual. TQ2_0_g128 on Metal uses per-kernel dispatch plus a blocks_as_bytes_ternary zero-copy upload, so the packed weights move to the GPU without a re-layout pass. Result: ~50 tok/s on Ternary-Bonsai-1.7B (best ~57).

Ternary on the CPU, too. GPUs are optional. The ternary line ships with three SIMD tiers for its TQ2 GEMV — NEON, AVX2, and AVX-512 — so it is quick even with no accelerator present. On Apple Silicon the NEON baseline sits around 7–8 tok/s. On x86-64, the right tier is chosen at runtime via is_x86_feature_detected! (AVX-512 → AVX2 → scalar fallback), so a single binary stays safe everywhere it lands.

Getting Started

Install the CLI, fetch and convert a ternary model, and run it:

# Install — provides the `oxibonsai` binary
cargo install oxibonsai-cli

# Fetch + convert a ternary model (HF safetensors → GGUF).
# Also downloads the tokenizer. Prerequisite: pip install huggingface_hub
./scripts/download_ternary.sh 1.7b

# Equivalent manual conversion, if you already have the unpacked safetensors:
# oxibonsai convert --quant tq2_0_g128 \
#   --from <unpacked-safetensors-dir> \
#   --to models/Ternary-Bonsai-1.7B.gguf

# Run it
oxibonsai run \
  --model models/Ternary-Bonsai-1.7B.gguf \
  --prompt "Explain ternary quantization in one sentence." \
  --max-tokens 256 \
  --temperature 0.7 \
  --top-p 0.9

On Apple Silicon, Metal turns on automatically — no flag needed; the fused TQ2 full-forward path is used as soon as you run on a Mac. For the 1-bit line, grab the tokenizer separately with oxibonsai tokenizer download, which fetches models/tokenizer.json.

What’s New in 0.1.1

Added

Changed

Fixed

Tips

This is the foundation

OxiBonsai sits on the COOLJAPAN ecosystem all the way down:

And it serves PrismML’s Bonsai model family — the 1-bit Q1_0_g128 line and the new ternary TQ2_0_g128 line. Every default-feature dependency is Pure Rust: zero C/C++/Fortran, zero FFI. The GPU backends (metal, native-cuda) are opt-in features — present when you want them, absent (and silent) when you don’t.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you want sub-2-bit LLM inference that runs on your laptop’s CPU, your Mac’s GPU, and your NVIDIA card — with nothing but safe Rust underneath.

Pure Rust sovereign sub-2-bit inference is here — and now it’s fast: CPU, Metal, and CUDA, one binary, no C in sight.

KitaSan at COOLJAPAN OÜ April 18, 2026

↑ Back to all posts