The bonsai just sprouted a GPU branch — and a second quant family to grow on.
Today we released OxiBonsai 0.1.1 — GPU acceleration arrives via a native CUDA NVRTC backend and a fused Metal full-forward path, and the new ternary TQ2_0_g128 quant line lands alongside the original 1-bit models.
If you missed the debut: OxiBonsai (オキシ盆栽) is the Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai family — quantizing 8B-class Qwen3 weights down past two bits and running them with no external runtime. The 0.1.0 release proved the 1-bit Q1_0_g128 line works end to end on CPU. This one makes it fast.
No llama.cpp. No BLAS. No C/C++/Fortran runtime.
And now that reach extends all the way to the accelerator — native CUDA via NVRTC, native Metal on Apple Silicon — still with no cuBLAS, no precompiled .cu blobs, no C library underneath any of it.
Everything compiles to a single static binary.
Why OxiBonsai 0.1.1 matters
The debut answered a question: can you really run a sub-2-bit LLM with nothing but safe Rust under it? Yes — on CPU, end to end, no FFI.
This release answers the next two: can it be fast, and can you trade a little memory for a little more quality? Also yes.
GPU acceleration is the headline. The fused Metal path lifts Ternary-Bonsai-1.7B from a NEON CPU baseline of roughly 7–8 tok/s to around 50 tok/s (best observed ~57) — a ~13x speedup — by collapsing an entire token’s worth of work into a single GPU command buffer. On NVIDIA, the native CUDA NVRTC backend runs the same model at ~21.9 tok/s on an RTX 3060 (CUDA 12.8), compiling its own kernels at runtime with no cuBLAS in sight.
The second answer is the ternary line. TQ2_0_g128 packs weights into {-1, 0, +1} at ~1.585 bits each — a hair under two bits, a hair over one — buying a few extra benchmark points over the 1-bit models for a modest memory cost. Same architecture, same runtime, same tokenizer, same server. Just point --model at a different file.
Technical Deep Dive
Two quant families now ship, side by side, both over the same Qwen3 backbone (GQA, SwiGLU, RoPE, RMSNorm) — which is why the runtime, tokenizer, and server stay identical across them:
- 1-bit
Q1_0_g128— 1.0 bit per weight, 128-weight blocks, a single FP16 group scale per block. This is the original Bonsai-8B line: the smallest possible footprint, ~1.15 GB at 8B scale. - Ternary
TQ2_0_g128— ~1.585 bits per weight, 128 weights packed into 34 bytes with an FP16 scale, encoding{-1, 0, +1}. The packing maps0b00 → −1,0b01 → 0,0b10 → +1, and0b11 → 0as well. This is the Ternary-Bonsai-8B / 4B / 1.7B line. Ternary costs roughly +600 MB at 8B scale versus 1-bit, and returns a few benchmark points for it.
Native CUDA via NVRTC. The CUDA backend uses NVRTC — NVIDIA’s runtime compiler — to build its kernels on the fly. There are no precompiled .cu blobs shipped in the crate, and there is no cuBLAS dependency: the fused full-forward path for both Q1 and TQ2 is hand-written and compiled at launch. On an RTX 3060 with CUDA 12.8, Ternary-Bonsai-1.7B runs at ~21.9 tok/s.
Fused Metal full-forward. The Metal path is where the ~13x speedup lives. Instead of submitting one GPU command per GEMV — the death-by-a-thousand-submissions pattern — 0.1.1 encodes an entire token’s forward pass into a single command buffer: roughly 14 dispatches per layer × N layers, all in one submission. Per layer that chain is RMSNorm → fused QKV GEMV → QK-norm + RoPE → KV-store → batched attention → attention-output GEMV + residual → FFN RMSNorm → gate + up GEMV → SwiGLU → down GEMV + residual. TQ2_0_g128 on Metal uses per-kernel dispatch plus a blocks_as_bytes_ternary zero-copy upload, so the packed weights move to the GPU without a re-layout pass. Result: ~50 tok/s on Ternary-Bonsai-1.7B (best ~57).
Ternary on the CPU, too. GPUs are optional. The ternary line ships with three SIMD tiers for its TQ2 GEMV — NEON, AVX2, and AVX-512 — so it is quick even with no accelerator present. On Apple Silicon the NEON baseline sits around 7–8 tok/s. On x86-64, the right tier is chosen at runtime via is_x86_feature_detected! (AVX-512 → AVX2 → scalar fallback), so a single binary stays safe everywhere it lands.
Getting Started
Install the CLI, fetch and convert a ternary model, and run it:
# Install — provides the `oxibonsai` binary
cargo install oxibonsai-cli
# Fetch + convert a ternary model (HF safetensors → GGUF).
# Also downloads the tokenizer. Prerequisite: pip install huggingface_hub
./scripts/download_ternary.sh 1.7b
# Equivalent manual conversion, if you already have the unpacked safetensors:
# oxibonsai convert --quant tq2_0_g128 \
# --from <unpacked-safetensors-dir> \
# --to models/Ternary-Bonsai-1.7B.gguf
# Run it
oxibonsai run \
--model models/Ternary-Bonsai-1.7B.gguf \
--prompt "Explain ternary quantization in one sentence." \
--max-tokens 256 \
--temperature 0.7 \
--top-p 0.9
On Apple Silicon, Metal turns on automatically — no flag needed; the fused TQ2 full-forward path is used as soon as you run on a Mac. For the 1-bit line, grab the tokenizer separately with oxibonsai tokenizer download, which fetches models/tokenizer.json.
What’s New in 0.1.1
Added
- Native CUDA NVRTC backend with a fused
Q1+TQ2full-forward path — ~21.9 tok/s on Ternary-Bonsai-1.7B (RTX 3060, CUDA 12.8). - Fused Metal
TQ2full-forward — a single GPU command buffer per token, ~50 tok/s on Ternary-Bonsai-1.7B (a ~13x speedup). - Ternary CPU SIMD tiers: NEON / AVX2 / AVX-512
TQ2GEMV. TQ2_0_g128support in the Metal backend (per-kernel dispatch +blocks_as_bytes_ternaryzero-copy upload).scripts/bench_ternary.sh— CPU vs Metal throughput benchmark (3-run average + best).scripts/download_ternary.sh— fetch and convert safetensors → GGUF.
Changed
- Version bumped to 0.1.1.
- Internal dependency versions aligned across the workspace.
- CUDA full-forward layer parameter handling refactored for cleaner weight management.
- Workspace
Cargo.tomlfiles unified on workspace dependencies for better crate compatibility.
Fixed
- Workspace version consistency across all subcrates.
blocks_as_bytesimport gating for broader feature-flag compatibility.
Tips
- Measure both paths. Run
./scripts/bench_ternary.shto compare CPU and Metal throughput head to head — it reports a 3-run average plus the best run, so you see steady-state and peak. - On a Mac, do nothing special. Metal is auto-detected on Apple Silicon. There is no flag to flip — just run on the Mac and the fused TQ2 full-forward path engages.
- On NVIDIA, use the NVRTC backend. Build and run with the
native-cudafeature; it runtime-compiles its own kernels, so there is no cuBLAS and no precompiled blob to manage. - Ternary is fast on CPU, too. The new NEON / AVX2 / AVX-512 TQ2 GEMV tiers mean you do not need a GPU to get usable speed. The correct tier is selected at runtime, so one binary is safe on every x86-64 machine.
- Choose your quant family by budget. Reach for 1-bit Bonsai-8B (~1.15 GB) when footprint is everything; reach for ternary
TQ2_0_g128when you want a few extra benchmark points. The CLI is identical — only the--modelpath changes. - For reproducible output, go greedy. Run with
--temperature 0. CPU and Metal then produce byte-identical text.
This is the foundation
OxiBonsai sits on the COOLJAPAN ecosystem all the way down:
- SciRS2 — tensor primitives.
- OxiBLAS — GEMM/GEMV plus the 1-bit and ternary compute kernels.
- OxiFFT — optional RoPE acceleration.
- NumRS2 — the N-dimensional array backend.
And it serves PrismML’s Bonsai model family — the 1-bit Q1_0_g128 line and the new ternary TQ2_0_g128 line. Every default-feature dependency is Pure Rust: zero C/C++/Fortran, zero FFI. The GPU backends (metal, native-cuda) are opt-in features — present when you want them, absent (and silent) when you don’t.
Repository: https://github.com/cool-japan/oxibonsai
Star the repo if you want sub-2-bit LLM inference that runs on your laptop’s CPU, your Mac’s GPU, and your NVIDIA card — with nothing but safe Rust underneath.
Pure Rust sovereign sub-2-bit inference is here — and now it’s fast: CPU, Metal, and CUDA, one binary, no C in sight.
— KitaSan at COOLJAPAN OÜ April 18, 2026