COOLJAPAN
← All posts

OxiBonsai 0.1.0 Released — The World's First Pure Rust 1-Bit LLM Inference Engine

An 8B-parameter language model at roughly 1 bit per weight, running from a single static Rust binary with no llama.cpp, no BLAS, no C/C++/Fortran. OxiBonsai 0.1.0 debuts sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem — SIMD-accelerated, Rayon-parallel, and OpenAI-compatible out of the box.

release oxibonsai llm inference pure-rust quantization 1-bit gguf simd wasm

An 8-billion-parameter model that weighs about 1 bit per parameter — and runs from a single Rust binary with zero C anywhere in the stack.

Today we released OxiBonsai 0.1.0 — the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family, debuting native support for the 1-bit line (Q1_0_g128).

No llama.cpp. No BLAS. No C, C++, or Fortran. No -sys crates, no system libraries, no patent-encumbered kernels. Just a memory-safe inference engine that compiles to one static binary and runs the same everywhere — on a laptop CPU, on a server, or in a browser tab via WASM. This is the foundation of sovereign AI inference for the COOLJAPAN ecosystem, and it starts here.

Why sub-2-bit matters

Modern open-weight LLMs are extraordinary, but they are also heavy. An 8B model in FP16 is ~16 GB of weights — too large to comfortably fit a laptop, an edge device, or a browser. The industry’s answer has been quantization: shrink each weight from 16 bits down to 8, 4, or even 2. But almost every fast quantized runtime in the world is built on the same C/C++ foundation (llama.cpp, GGML, BLAS), with all the build complexity, memory-unsafety, and supply-chain risk that entails.

PrismML’s Bonsai family pushes quantization to the limit: roughly 1 bit per weight. An 8B Bonsai model is just ~1.15 GB on disk — small enough to curl in seconds and load on almost anything. That changes what’s possible: real language-model inference on commodity hardware, offline, with no GPU required.

OxiBonsai is the engine that makes those models run — and it does it in 100% Pure Rust. To our knowledge, it is the first C/C++/Fortran-free, zero-FFI inference engine for the Bonsai 1-bit family. Memory safety isn’t a footnote here; it’s the whole point. When your inference kernels are the trusted core of an AI system, “no segfaults, no buffer overruns, no undefined behavior” is a security property, not a nicety.

What Q1_0_g128 actually means

The Q1_0_g128 format is the heart of the 0.1.0 release, so it’s worth unpacking the name:

OxiBonsai parses this format straight from GGUF files with a streaming parser — it reads tensors as it goes rather than slurping the whole file into memory, so model load is fast and memory-light. On top of that sit hand-written 1-bit kernels for the three operations that dominate transformer inference: dequantization, GEMV (matrix-vector, the decode-step workhorse), and GEMM (matrix-matrix, for prefill).

Technical Deep Dive

OxiBonsai 0.1.0 is a full inference stack, not just a kernel library. The pieces that ship in this release:

  1. 1-bit GGUF loader. A streaming Q1_0_g128 parser reads grouped weights and their FP16 block scales directly from GGUF, with no intermediate dequantized copy held in memory.

  2. Optimized 1-bit kernels. Dedicated dequantization, GEMV, and GEMM paths for the 1-bit format — the inner loops that decide whether the whole thing is fast or not.

  3. SIMD auto-dispatch. The kernels are accelerated with AVX2, AVX-512, NEON, and WASM SIMD. The best instruction set is chosen at runtime via CPU feature detection, so a single x86-64 binary safely uses AVX-512 on a Xeon, AVX2 on a consumer laptop, and falls back to the scalar reference path on anything older — no recompile, no SIGILL.

  4. Rayon-parallel dispatch. Kernel work is fanned out across cores with Rayon, so wider machines simply go faster.

  5. Qwen3 transformer with paged KV-cache. The model layer implements the full Qwen3 decoder architecture — GQA attention, SwiGLU feed-forward, RoPE positional encoding, RMSNorm — with a paged KV-cache so long contexts don’t fragment memory.

  6. Autoregressive runtime. A high-level inference runtime drives token-by-token generation, with sampling strategies covering greedy, top-k, top-p (nucleus), and temperature.

  7. OpenAI-compatible server. A REST API server exposes chat completions, completions, and embeddings endpoints, with streaming token output over SSE — a drop-in replacement target for tools that already speak the OpenAI API.

Rounding out 0.1.0: a Pure Rust BPE tokenizer, a RAG pipeline with chunking and similarity search, a model evaluation framework (accuracy and perplexity metrics), speculative decoding support, a WASM compilation target, and cross-platform builds for macOS, Linux, Windows, and WASM — all covered by a suite of 140 tests.

Getting Started

Install the CLI (Rust 1.86+):

cargo install oxibonsai-cli        # installs the `oxibonsai` binary

Grab a model and the tokenizer:

# 1-bit Bonsai-8B — ~1.15 GB pre-quantized GGUF, single curl
mkdir -p models
curl -L -o models/Bonsai-8B.gguf \
  https://huggingface.co/prism-ml/Bonsai-8B-gguf/resolve/main/Bonsai-8B.gguf

# Tokenizer (~2.7 MB, pulled from Qwen/Qwen3-8B on HuggingFace)
oxibonsai tokenizer download       # saves to models/tokenizer.json

Run inference:

oxibonsai run --model models/Bonsai-8B.gguf \
  --prompt "Explain quantum computing in simple terms" \
  --max-tokens 512 --temperature 0.7 --top-p 0.9

Or stand up an OpenAI-compatible server:

oxibonsai serve --model models/Bonsai-8B.gguf \
  --host 127.0.0.1 --port 8080

…and point any OpenAI client at http://127.0.0.1:8080/v1.

What’s New in 0.1.0

This is the very first OxiBonsai release. Everything is new:

Tips

This is the foundation

OxiBonsai is built entirely on the COOLJAPAN ecosystemSciRS2, OxiBLAS, and OxiFFT — which is exactly why it can run a real LLM with no C, C++, or Fortran anywhere in the dependency tree. It is the sovereign inference layer for PrismML’s Bonsai models, and the start of a sub-2-bit AI stack that belongs to you, not to a pile of -sys crates and system libraries.

This 0.1.0 release lands the 1-bit (Q1_0_g128) line. It is the seed; much more is coming.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you want a future where running your own language model means one static binary and nothing else.

Pure Rust sovereign sub-2-bit inference is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ April 13, 2026

↑ Back to all posts