OxiBonsai 0.1.0 Released — The World's First Pure Rust 1-Bit LLM Inference Engine

An 8-billion-parameter model that weighs about 1 bit per parameter — and runs from a single Rust binary with zero C anywhere in the stack.

Today we released OxiBonsai 0.1.0 — the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family, debuting native support for the 1-bit line (Q1_0_g128).

No llama.cpp. No BLAS. No C, C++, or Fortran. No -sys crates, no system libraries, no patent-encumbered kernels. Just a memory-safe inference engine that compiles to one static binary and runs the same everywhere — on a laptop CPU, on a server, or in a browser tab via WASM. This is the foundation of sovereign AI inference for the COOLJAPAN ecosystem, and it starts here.

Why sub-2-bit matters

Modern open-weight LLMs are extraordinary, but they are also heavy. An 8B model in FP16 is ~16 GB of weights — too large to comfortably fit a laptop, an edge device, or a browser. The industry’s answer has been quantization: shrink each weight from 16 bits down to 8, 4, or even 2. But almost every fast quantized runtime in the world is built on the same C/C++ foundation (llama.cpp, GGML, BLAS), with all the build complexity, memory-unsafety, and supply-chain risk that entails.

PrismML’s Bonsai family pushes quantization to the limit: roughly 1 bit per weight. An 8B Bonsai model is just ~1.15 GB on disk — small enough to curl in seconds and load on almost anything. That changes what’s possible: real language-model inference on commodity hardware, offline, with no GPU required.

OxiBonsai is the engine that makes those models run — and it does it in 100% Pure Rust. To our knowledge, it is the first C/C++/Fortran-free, zero-FFI inference engine for the Bonsai 1-bit family. Memory safety isn’t a footnote here; it’s the whole point. When your inference kernels are the trusted core of an AI system, “no segfaults, no buffer overruns, no undefined behavior” is a security property, not a nicety.

What `Q1_0_g128` actually means

The Q1_0_g128 format is the heart of the 0.1.0 release, so it’s worth unpacking the name:

Q1 — 1 bit per weight. Each weight is stored as a single bit and expands to a signed value at inference time. This is what gets an 8B model down to ~1.15 GB.
_0 — the base, first-generation encoding revision of the 1-bit format.
g128 — a group size of 128. Weights are quantized in blocks of 128, and each block carries its own FP16 scale factor. Grouping is the trick that keeps 1-bit quantization accurate: instead of one global scale for an entire tensor, every 128-weight block gets a locally-fitted scale, so the dynamic range of each region of the matrix is preserved.

OxiBonsai parses this format straight from GGUF files with a streaming parser — it reads tensors as it goes rather than slurping the whole file into memory, so model load is fast and memory-light. On top of that sit hand-written 1-bit kernels for the three operations that dominate transformer inference: dequantization, GEMV (matrix-vector, the decode-step workhorse), and GEMM (matrix-matrix, for prefill).

Technical Deep Dive

OxiBonsai 0.1.0 is a full inference stack, not just a kernel library. The pieces that ship in this release:

1-bit GGUF loader. A streaming Q1_0_g128 parser reads grouped weights and their FP16 block scales directly from GGUF, with no intermediate dequantized copy held in memory.
Optimized 1-bit kernels. Dedicated dequantization, GEMV, and GEMM paths for the 1-bit format — the inner loops that decide whether the whole thing is fast or not.
SIMD auto-dispatch. The kernels are accelerated with AVX2, AVX-512, NEON, and WASM SIMD. The best instruction set is chosen at runtime via CPU feature detection, so a single x86-64 binary safely uses AVX-512 on a Xeon, AVX2 on a consumer laptop, and falls back to the scalar reference path on anything older — no recompile, no SIGILL.
Rayon-parallel dispatch. Kernel work is fanned out across cores with Rayon, so wider machines simply go faster.
Qwen3 transformer with paged KV-cache. The model layer implements the full Qwen3 decoder architecture — GQA attention, SwiGLU feed-forward, RoPE positional encoding, RMSNorm — with a paged KV-cache so long contexts don’t fragment memory.
Autoregressive runtime. A high-level inference runtime drives token-by-token generation, with sampling strategies covering greedy, top-k, top-p (nucleus), and temperature.
OpenAI-compatible server. A REST API server exposes chat completions, completions, and embeddings endpoints, with streaming token output over SSE — a drop-in replacement target for tools that already speak the OpenAI API.

Rounding out 0.1.0: a Pure Rust BPE tokenizer, a RAG pipeline with chunking and similarity search, a model evaluation framework (accuracy and perplexity metrics), speculative decoding support, a WASM compilation target, and cross-platform builds for macOS, Linux, Windows, and WASM — all covered by a suite of 140 tests.

Getting Started

Install the CLI (Rust 1.86+):

cargo install oxibonsai-cli        # installs the `oxibonsai` binary

Grab a model and the tokenizer:

# 1-bit Bonsai-8B — ~1.15 GB pre-quantized GGUF, single curl
mkdir -p models
curl -L -o models/Bonsai-8B.gguf \
  https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf

# Tokenizer (~2.7 MB, pulled from Qwen/Qwen3-8B on HuggingFace)
oxibonsai tokenizer download       # saves to models/tokenizer.json

Run inference:

oxibonsai run --model models/Bonsai-8B.gguf \
  --prompt "Explain quantum computing in simple terms" \
  --max-tokens 512 --temperature 0.7 --top-p 0.9

Or stand up an OpenAI-compatible server:

oxibonsai serve --model models/Bonsai-8B.gguf \
  --host 127.0.0.1 --port 8080

…and point any OpenAI client at http://127.0.0.1:8080/v1.

What’s New in 0.1.0

This is the very first OxiBonsai release. Everything is new:

Pure Rust 1-bit LLM inference engine for PrismML Bonsai models.
GGUF Q1_0_g128 support with a streaming parser.
Optimized 1-bit kernels — dequantization, GEMV, and GEMM.
SIMD acceleration across AVX2, AVX-512, NEON, and WASM SIMD.
Rayon-parallel kernel dispatch.
Qwen3 transformer implementation with a paged KV-cache.
Autoregressive generation runtime with greedy / top-k / top-p / temperature sampling.
OpenAI-compatible REST server — chat completions, completions, embeddings, plus SSE streaming.
Pure Rust BPE tokenizer, a RAG pipeline, and a model evaluation framework.
Speculative decoding and a WASM build target.
Cross-platform: macOS, Linux, Windows, WASM — backed by 140 tests.

Tips

Let SIMD pick itself. You don’t choose an instruction set — the dispatcher detects AVX-512, AVX2, or NEON at runtime and uses the widest one your CPU supports, falling back to scalar automatically. The same binary is fast and safe across every machine.
Compile for the browser. Because the kernels include a WASM SIMD path, OxiBonsai targets wasm32 — a sub-2-bit LLM running entirely client-side, with no inference server behind it.
Use it as a drop-in OpenAI endpoint. oxibonsai serve speaks the OpenAI REST shape (/v1/chat/completions, /v1/completions, /v1/embeddings). Most existing OpenAI SDKs and tools just need their base URL repointed at your local server.
Stream tokens for responsive UIs. The server emits tokens over SSE as they’re generated, so chat interfaces feel live instead of waiting for the full completion.
Dial in your sampling. For deterministic output use --temperature 0 (greedy); for general chat, --temperature 0.7 --top-p 0.9 is a solid default; raise --top-p and temperature for more creative generation.
Scale with cores. Kernel dispatch is Rayon-parallel, so more CPU cores translate directly into more tokens per second — no flags required.

This is the foundation

OxiBonsai is built entirely on the COOLJAPAN ecosystem — SciRS2, OxiBLAS, and OxiFFT — which is exactly why it can run a real LLM with no C, C++, or Fortran anywhere in the dependency tree. It is the sovereign inference layer for PrismML’s Bonsai models, and the start of a sub-2-bit AI stack that belongs to you, not to a pile of -sys crates and system libraries.

This 0.1.0 release lands the 1-bit (Q1_0_g128) line. It is the seed; much more is coming.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you want a future where running your own language model means one static binary and nothing else.

Pure Rust sovereign sub-2-bit inference is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ April 13, 2026