OxiBonsai 0.1.1 Released — Sub-2-Bit Inference Goes GPU, and the Ternary Line Lands

The bonsai just sprouted a GPU branch — and a second quant family to grow on.

Today we released OxiBonsai 0.1.1 — GPU acceleration arrives via a native CUDA NVRTC backend and a fused Metal full-forward path, and the new ternary TQ2_0_g128 quant line lands alongside the original 1-bit models.

If you missed the debut: OxiBonsai (オキシ盆栽) is the Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai family — quantizing 8B-class Qwen3 weights down past two bits and running them with no external runtime. The 0.1.0 release proved the 1-bit Q1_0_g128 line works end to end on CPU. This one makes it fast.

No llama.cpp. No BLAS. No C/C++/Fortran runtime.
And now that reach extends all the way to the accelerator — native CUDA via NVRTC, native Metal on Apple Silicon — still with no cuBLAS, no precompiled .cu blobs, no C library underneath any of it.
Everything compiles to a single static binary.

Why OxiBonsai 0.1.1 matters

The debut answered a question: can you really run a sub-2-bit LLM with nothing but safe Rust under it? Yes — on CPU, end to end, no FFI.

This release answers the next two: can it be fast, and can you trade a little memory for a little more quality? Also yes.

GPU acceleration is the headline. The fused Metal path lifts Ternary-Bonsai-1.7B from a NEON CPU baseline of roughly 7–8 tok/s to around 50 tok/s (best observed ~57) — a ~13x speedup — by collapsing an entire token’s worth of work into a single GPU command buffer. On NVIDIA, the native CUDA NVRTC backend runs the same model at ~21.9 tok/s on an RTX 3060 (CUDA 12.8), compiling its own kernels at runtime with no cuBLAS in sight.

The second answer is the ternary line. TQ2_0_g128 packs weights into {-1, 0, +1} at ~1.585 bits each — a hair under two bits, a hair over one — buying a few extra benchmark points over the 1-bit models for a modest memory cost. Same architecture, same runtime, same tokenizer, same server. Just point --model at a different file.

Technical Deep Dive

Two quant families now ship, side by side, both over the same Qwen3 backbone (GQA, SwiGLU, RoPE, RMSNorm) — which is why the runtime, tokenizer, and server stay identical across them:

1-bit Q1_0_g128 — 1.0 bit per weight, 128-weight blocks, a single FP16 group scale per block. This is the original Bonsai-8B line: the smallest possible footprint, ~1.15 GB at 8B scale.
Ternary TQ2_0_g128 — ~1.585 bits per weight, 128 weights packed into 34 bytes with an FP16 scale, encoding {-1, 0, +1}. The packing maps 0b00 → −1, 0b01 → 0, 0b10 → +1, and 0b11 → 0 as well. This is the Ternary-Bonsai-8B / 4B / 1.7B line. Ternary costs roughly +600 MB at 8B scale versus 1-bit, and returns a few benchmark points for it.

Native CUDA via NVRTC. The CUDA backend uses NVRTC — NVIDIA’s runtime compiler — to build its kernels on the fly. There are no precompiled .cu blobs shipped in the crate, and there is no cuBLAS dependency: the fused full-forward path for both Q1 and TQ2 is hand-written and compiled at launch. On an RTX 3060 with CUDA 12.8, Ternary-Bonsai-1.7B runs at ~21.9 tok/s.

Fused Metal full-forward. The Metal path is where the ~13x speedup lives. Instead of submitting one GPU command per GEMV — the death-by-a-thousand-submissions pattern — 0.1.1 encodes an entire token’s forward pass into a single command buffer: roughly 14 dispatches per layer × N layers, all in one submission. Per layer that chain is RMSNorm → fused QKV GEMV → QK-norm + RoPE → KV-store → batched attention → attention-output GEMV + residual → FFN RMSNorm → gate + up GEMV → SwiGLU → down GEMV + residual. TQ2_0_g128 on Metal uses per-kernel dispatch plus a blocks_as_bytes_ternary zero-copy upload, so the packed weights move to the GPU without a re-layout pass. Result: ~50 tok/s on Ternary-Bonsai-1.7B (best ~57).

Ternary on the CPU, too. GPUs are optional. The ternary line ships with three SIMD tiers for its TQ2 GEMV — NEON, AVX2, and AVX-512 — so it is quick even with no accelerator present. On Apple Silicon the NEON baseline sits around 7–8 tok/s. On x86-64, the right tier is chosen at runtime via is_x86_feature_detected! (AVX-512 → AVX2 → scalar fallback), so a single binary stays safe everywhere it lands.

Getting Started

Install the CLI, fetch and convert a ternary model, and run it:

# Install — provides the `oxibonsai` binary
cargo install oxibonsai-cli

# Fetch + convert a ternary model (HF safetensors → GGUF).
# Also downloads the tokenizer. Prerequisite: pip install huggingface_hub
./scripts/download_ternary.sh 1.7b

# Equivalent manual conversion, if you already have the unpacked safetensors:
# oxibonsai convert --quant tq2_0_g128 \
#   --from <unpacked-safetensors-dir> \
#   --to models/Ternary-Bonsai-1.7B.gguf

# Run it
oxibonsai run \
  --model models/Ternary-Bonsai-1.7B.gguf \
  --prompt "Explain ternary quantization in one sentence." \
  --max-tokens 256 \
  --temperature 0.7 \
  --top-p 0.9

On Apple Silicon, Metal turns on automatically — no flag needed; the fused TQ2 full-forward path is used as soon as you run on a Mac. For the 1-bit line, grab the tokenizer separately with oxibonsai tokenizer download, which fetches models/tokenizer.json.

What’s New in 0.1.1

Added

Native CUDA NVRTC backend with a fused Q1 + TQ2 full-forward path — ~21.9 tok/s on Ternary-Bonsai-1.7B (RTX 3060, CUDA 12.8).
Fused Metal TQ2 full-forward — a single GPU command buffer per token, ~50 tok/s on Ternary-Bonsai-1.7B (a ~13x speedup).
Ternary CPU SIMD tiers: NEON / AVX2 / AVX-512 TQ2 GEMV.
TQ2_0_g128 support in the Metal backend (per-kernel dispatch + blocks_as_bytes_ternary zero-copy upload).
scripts/bench_ternary.sh — CPU vs Metal throughput benchmark (3-run average + best).
scripts/download_ternary.sh — fetch and convert safetensors → GGUF.

Changed

Version bumped to 0.1.1.
Internal dependency versions aligned across the workspace.
CUDA full-forward layer parameter handling refactored for cleaner weight management.
Workspace Cargo.toml files unified on workspace dependencies for better crate compatibility.

Fixed

Workspace version consistency across all subcrates.
blocks_as_bytes import gating for broader feature-flag compatibility.

Tips

Measure both paths. Run ./scripts/bench_ternary.sh to compare CPU and Metal throughput head to head — it reports a 3-run average plus the best run, so you see steady-state and peak.
On a Mac, do nothing special. Metal is auto-detected on Apple Silicon. There is no flag to flip — just run on the Mac and the fused TQ2 full-forward path engages.
On NVIDIA, use the NVRTC backend. Build and run with the native-cuda feature; it runtime-compiles its own kernels, so there is no cuBLAS and no precompiled blob to manage.
Ternary is fast on CPU, too. The new NEON / AVX2 / AVX-512 TQ2 GEMV tiers mean you do not need a GPU to get usable speed. The correct tier is selected at runtime, so one binary is safe on every x86-64 machine.
Choose your quant family by budget. Reach for 1-bit Bonsai-8B (~1.15 GB) when footprint is everything; reach for ternary TQ2_0_g128 when you want a few extra benchmark points. The CLI is identical — only the --model path changes.
For reproducible output, go greedy. Run with --temperature 0. CPU and Metal then produce byte-identical text.

This is the foundation

OxiBonsai sits on the COOLJAPAN ecosystem all the way down:

SciRS2 — tensor primitives.
OxiBLAS — GEMM/GEMV plus the 1-bit and ternary compute kernels.
OxiFFT — optional RoPE acceleration.
NumRS2 — the N-dimensional array backend.

And it serves PrismML’s Bonsai model family — the 1-bit Q1_0_g128 line and the new ternary TQ2_0_g128 line. Every default-feature dependency is Pure Rust: zero C/C++/Fortran, zero FFI. The GPU backends (metal, native-cuda) are opt-in features — present when you want them, absent (and silent) when you don’t.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you want sub-2-bit LLM inference that runs on your laptop’s CPU, your Mac’s GPU, and your NVIDIA card — with nothing but safe Rust underneath.

Pure Rust sovereign sub-2-bit inference is here — and now it’s fast: CPU, Metal, and CUDA, one binary, no C in sight.

— KitaSan at COOLJAPAN OÜ April 18, 2026