COOLJAPAN
← All posts

OxiBonsai 0.1.4 Released — Production-Grade Sovereign Serving: Self-Tuning Runtime, Prometheus + X-Request-ID Observability, FP8 & K-Quant, and Grammar-Constrained Output

OxiBonsai 0.1.4 makes Pure Rust sub-2-bit inference production-grade for serving: adaptive KV-cache compression and adaptive speculative decoding that self-tune under load, full Prometheus observability with per-request X-Request-ID tracing, new FP8 and K-quant GGUF model support, and grammar-constrained decoding for guaranteed-valid JSON — sovereign AI inference for the COOLJAPAN ecosystem.

release oxibonsai llm inference pure-rust quantization observability prometheus structured-decoding fp8

The engine that serves itself: a sub-2-bit inference runtime that tunes its own KV-cache and draft length under load, reports every request to Prometheus, and can guarantee the JSON it emits is valid — all in Pure Rust.

Today we released OxiBonsai 0.1.4 — a production-grade serving upgrade that turns the fast KV-cache reuse of 0.1.3 into a self-tuning, fully observable runtime with new FP8 and K-quant model support and grammar-constrained decoding.

No llama.cpp. No BLAS. No C, no C++, no Fortran. OxiBonsai 0.1.4 remains the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU SIMD (AVX2/AVX-512/NEON/WASM), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC), built end-to-end on the COOLJAPAN ecosystem.

Why OxiBonsai 0.1.4 matters

Version 0.1.3 made KV-cache reuse fast with the PrefixCachedEngine, tokenizer auto-detect, and GPU weight caching. But fast is not the same as production-ready. Serving real traffic means the engine has to behave when load spikes, when caches fill, and when downstream systems need to know exactly what happened to each request.

0.1.4 closes that gap. The engine now self-tunes under load — KV-cache precision and speculative-decoding draft length adapt automatically instead of being fixed knobs. It is fully observable — every request flows through Prometheus gauges and a per-request X-Request-ID tracing span. It expands the supported formats — standard FP8 and K-quant GGUF models load and run alongside the sub-2-bit Bonsai formats. And it can guarantee structurally valid output — a grammar-constrained sampler that can force always-valid JSON at the token level. That is the difference between a fast research engine and one you can put behind a load balancer.

Technical Deep Dive

Theme 1 — Production runtime controllers and observability

The new controllers live in oxibonsai-runtime.

They are designed to react to live workload signals rather than static configuration.

New Prometheus gauges in InferenceMetrics expose the live picture:

The engine surface grew to match:

On the HTTP side, GET /admin/workload-stats returns a JSON snapshot of two things:

The OpenAI server now honors an X-Request-ID header. Client-supplied ids in UUID 8-4-4-4-12 form or as 32-char hex are echoed back verbatim, while absent or malformed headers trigger an auto-generated RequestId. Both streaming and non-streaming responses carry the header, and server tracing spans record request_id for end-to-end correlation. The header constant is REQUEST_ID_HEADER, with helpers resolve_request_id(&HeaderMap) and request_id_header_map(RequestId). See examples/runtime_controllers.rs and the criterion microbenchmarks in benches/controllers_bench.rs.

Theme 2 — New quantization families

OxiBonsai now loads and runs standard FP8 and K-quant GGUF models in addition to the sub-2-bit Bonsai formats, with kernels spanning oxibonsai-kernels, oxibonsai-model, and oxibonsai-core.

FP8 family. The BlockFP8E4M3 and BlockFP8E5M2 block types pack 32 weights plus an FP16 scale into 34-byte blocks, with bit-exact IEEE-754-style encode/decode using RNE rounding.

K-quant family. Standard K-quant GGUF formats now run with dedicated GPU kernels.

All batch kernels use a “cap-of-8” outer-loop pattern, so arbitrary batch sizes are processed correctly with no silent truncation past eight columns.

Theme 3 — Constrained and structured decoding

0.1.4 lets you constrain the sampler at the token level so the output is structurally valid by construction.

Every source file in the workspace stays well under the 2000-line-per-file policy with substantial headroom, so the controllers, kernels, and grammar engine remain easy to read and audit.

Getting Started

Install the CLI, which provides the oxibonsai binary:

cargo install oxibonsai-cli   # installs the `oxibonsai` binary

Wire up the self-tuning controllers in a few lines:

use oxibonsai_runtime::{KvCachePolicy, AdaptiveLookahead, AdaptiveLookaheadConfig};

// KV cache policy: FP16 ↔ Q8 ↔ Q4 driven by EWMA pressure with hysteresis.
let kv = KvCachePolicy::default();
let level = kv.observe(0.92);  // → escalates to Q8 once smoothed pressure crosses 0.80

// Speculative-decoding draft length: continuously updated from acceptance EWMA.
let mut k = AdaptiveLookahead::new(AdaptiveLookaheadConfig::default());
k.observe_step(5, 4);  // proposed=5, accepted=4 → k drifts toward 5

Run the full controllers example:

cargo run --example runtime_controllers

And start the production serving entry point:

oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf

What’s New in 0.1.4

Tips

This is the foundation

OxiBonsai is built entirely on the COOLJAPAN ecosystem — SciRS2, OxiBLAS, OxiFFT, OxiARC, and OxiONNX — to run PrismML’s Bonsai sub-2-bit models with zero FFI and zero C/C++/Fortran runtime. 0.1.4 takes that sovereign stack from fast to production-grade: self-tuning, observable, format-rich, and structurally safe, with no foreign code anywhere in the default build.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you want production-grade sovereign serving — an inference engine that tunes itself, reports everything, and answers to no foreign runtime.

Pure Rust sovereign inference, ready for real traffic, is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ May 16, 2026

↑ Back to all posts