An 8-billion-parameter model that weighs about 1 bit per parameter — and runs from a single Rust binary with zero C anywhere in the stack.
Today we released OxiBonsai 0.1.0 — the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family, debuting native support for the 1-bit line (Q1_0_g128).
No llama.cpp. No BLAS. No C, C++, or Fortran. No -sys crates, no system libraries, no patent-encumbered kernels. Just a memory-safe inference engine that compiles to one static binary and runs the same everywhere — on a laptop CPU, on a server, or in a browser tab via WASM. This is the foundation of sovereign AI inference for the COOLJAPAN ecosystem, and it starts here.
Why sub-2-bit matters
Modern open-weight LLMs are extraordinary, but they are also heavy. An 8B model in FP16 is ~16 GB of weights — too large to comfortably fit a laptop, an edge device, or a browser. The industry’s answer has been quantization: shrink each weight from 16 bits down to 8, 4, or even 2. But almost every fast quantized runtime in the world is built on the same C/C++ foundation (llama.cpp, GGML, BLAS), with all the build complexity, memory-unsafety, and supply-chain risk that entails.
PrismML’s Bonsai family pushes quantization to the limit: roughly 1 bit per weight. An 8B Bonsai model is just ~1.15 GB on disk — small enough to curl in seconds and load on almost anything. That changes what’s possible: real language-model inference on commodity hardware, offline, with no GPU required.
OxiBonsai is the engine that makes those models run — and it does it in 100% Pure Rust. To our knowledge, it is the first C/C++/Fortran-free, zero-FFI inference engine for the Bonsai 1-bit family. Memory safety isn’t a footnote here; it’s the whole point. When your inference kernels are the trusted core of an AI system, “no segfaults, no buffer overruns, no undefined behavior” is a security property, not a nicety.
What Q1_0_g128 actually means
The Q1_0_g128 format is the heart of the 0.1.0 release, so it’s worth unpacking the name:
Q1— 1 bit per weight. Each weight is stored as a single bit and expands to a signed value at inference time. This is what gets an 8B model down to ~1.15 GB._0— the base, first-generation encoding revision of the 1-bit format.g128— a group size of 128. Weights are quantized in blocks of 128, and each block carries its own FP16 scale factor. Grouping is the trick that keeps 1-bit quantization accurate: instead of one global scale for an entire tensor, every 128-weight block gets a locally-fitted scale, so the dynamic range of each region of the matrix is preserved.
OxiBonsai parses this format straight from GGUF files with a streaming parser — it reads tensors as it goes rather than slurping the whole file into memory, so model load is fast and memory-light. On top of that sit hand-written 1-bit kernels for the three operations that dominate transformer inference: dequantization, GEMV (matrix-vector, the decode-step workhorse), and GEMM (matrix-matrix, for prefill).
Technical Deep Dive
OxiBonsai 0.1.0 is a full inference stack, not just a kernel library. The pieces that ship in this release:
-
1-bit GGUF loader. A streaming
Q1_0_g128parser reads grouped weights and their FP16 block scales directly from GGUF, with no intermediate dequantized copy held in memory. -
Optimized 1-bit kernels. Dedicated dequantization, GEMV, and GEMM paths for the 1-bit format — the inner loops that decide whether the whole thing is fast or not.
-
SIMD auto-dispatch. The kernels are accelerated with AVX2, AVX-512, NEON, and WASM SIMD. The best instruction set is chosen at runtime via CPU feature detection, so a single x86-64 binary safely uses AVX-512 on a Xeon, AVX2 on a consumer laptop, and falls back to the scalar reference path on anything older — no recompile, no
SIGILL. -
Rayon-parallel dispatch. Kernel work is fanned out across cores with Rayon, so wider machines simply go faster.
-
Qwen3 transformer with paged KV-cache. The model layer implements the full Qwen3 decoder architecture — GQA attention, SwiGLU feed-forward, RoPE positional encoding, RMSNorm — with a paged KV-cache so long contexts don’t fragment memory.
-
Autoregressive runtime. A high-level inference runtime drives token-by-token generation, with sampling strategies covering greedy, top-k, top-p (nucleus), and temperature.
-
OpenAI-compatible server. A REST API server exposes chat completions, completions, and embeddings endpoints, with streaming token output over SSE — a drop-in replacement target for tools that already speak the OpenAI API.
Rounding out 0.1.0: a Pure Rust BPE tokenizer, a RAG pipeline with chunking and similarity search, a model evaluation framework (accuracy and perplexity metrics), speculative decoding support, a WASM compilation target, and cross-platform builds for macOS, Linux, Windows, and WASM — all covered by a suite of 140 tests.
Getting Started
Install the CLI (Rust 1.86+):
cargo install oxibonsai-cli # installs the `oxibonsai` binary
Grab a model and the tokenizer:
# 1-bit Bonsai-8B — ~1.15 GB pre-quantized GGUF, single curl
mkdir -p models
curl -L -o models/Bonsai-8B.gguf \
https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf
# Tokenizer (~2.7 MB, pulled from Qwen/Qwen3-8B on HuggingFace)
oxibonsai tokenizer download # saves to models/tokenizer.json
Run inference:
oxibonsai run --model models/Bonsai-8B.gguf \
--prompt "Explain quantum computing in simple terms" \
--max-tokens 512 --temperature 0.7 --top-p 0.9
Or stand up an OpenAI-compatible server:
oxibonsai serve --model models/Bonsai-8B.gguf \
--host 127.0.0.1 --port 8080
…and point any OpenAI client at http://127.0.0.1:8080/v1.
What’s New in 0.1.0
This is the very first OxiBonsai release. Everything is new:
- Pure Rust 1-bit LLM inference engine for PrismML Bonsai models.
- GGUF
Q1_0_g128support with a streaming parser. - Optimized 1-bit kernels — dequantization, GEMV, and GEMM.
- SIMD acceleration across AVX2, AVX-512, NEON, and WASM SIMD.
- Rayon-parallel kernel dispatch.
- Qwen3 transformer implementation with a paged KV-cache.
- Autoregressive generation runtime with greedy / top-k / top-p / temperature sampling.
- OpenAI-compatible REST server — chat completions, completions, embeddings, plus SSE streaming.
- Pure Rust BPE tokenizer, a RAG pipeline, and a model evaluation framework.
- Speculative decoding and a WASM build target.
- Cross-platform: macOS, Linux, Windows, WASM — backed by 140 tests.
Tips
- Let SIMD pick itself. You don’t choose an instruction set — the dispatcher detects AVX-512, AVX2, or NEON at runtime and uses the widest one your CPU supports, falling back to scalar automatically. The same binary is fast and safe across every machine.
- Compile for the browser. Because the kernels include a WASM SIMD path, OxiBonsai targets
wasm32— a sub-2-bit LLM running entirely client-side, with no inference server behind it. - Use it as a drop-in OpenAI endpoint.
oxibonsai servespeaks the OpenAI REST shape (/v1/chat/completions,/v1/completions,/v1/embeddings). Most existing OpenAI SDKs and tools just need their base URL repointed at your local server. - Stream tokens for responsive UIs. The server emits tokens over SSE as they’re generated, so chat interfaces feel live instead of waiting for the full completion.
- Dial in your sampling. For deterministic output use
--temperature 0(greedy); for general chat,--temperature 0.7 --top-p 0.9is a solid default; raise--top-pand temperature for more creative generation. - Scale with cores. Kernel dispatch is Rayon-parallel, so more CPU cores translate directly into more tokens per second — no flags required.
This is the foundation
OxiBonsai is built entirely on the COOLJAPAN ecosystem — SciRS2, OxiBLAS, and OxiFFT — which is exactly why it can run a real LLM with no C, C++, or Fortran anywhere in the dependency tree. It is the sovereign inference layer for PrismML’s Bonsai models, and the start of a sub-2-bit AI stack that belongs to you, not to a pile of -sys crates and system libraries.
This 0.1.0 release lands the 1-bit (Q1_0_g128) line. It is the seed; much more is coming.
Repository: https://github.com/cool-japan/oxibonsai
Star the repo if you want a future where running your own language model means one static binary and nothing else.
Pure Rust sovereign sub-2-bit inference is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ April 13, 2026