The engine that serves itself: a sub-2-bit inference runtime that tunes its own KV-cache and draft length under load, reports every request to Prometheus, and can guarantee the JSON it emits is valid — all in Pure Rust.
Today we released OxiBonsai 0.1.4 — a production-grade serving upgrade that turns the fast KV-cache reuse of 0.1.3 into a self-tuning, fully observable runtime with new FP8 and K-quant model support and grammar-constrained decoding.
No llama.cpp. No BLAS. No C, no C++, no Fortran. OxiBonsai 0.1.4 remains the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU SIMD (AVX2/AVX-512/NEON/WASM), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC), built end-to-end on the COOLJAPAN ecosystem.
Why OxiBonsai 0.1.4 matters
Version 0.1.3 made KV-cache reuse fast with the PrefixCachedEngine, tokenizer auto-detect, and GPU weight caching. But fast is not the same as production-ready. Serving real traffic means the engine has to behave when load spikes, when caches fill, and when downstream systems need to know exactly what happened to each request.
0.1.4 closes that gap. The engine now self-tunes under load — KV-cache precision and speculative-decoding draft length adapt automatically instead of being fixed knobs. It is fully observable — every request flows through Prometheus gauges and a per-request X-Request-ID tracing span. It expands the supported formats — standard FP8 and K-quant GGUF models load and run alongside the sub-2-bit Bonsai formats. And it can guarantee structurally valid output — a grammar-constrained sampler that can force always-valid JSON at the token level. That is the difference between a fast research engine and one you can put behind a load balancer.
Technical Deep Dive
Theme 1 — Production runtime controllers and observability
The new controllers live in oxibonsai-runtime.
They are designed to react to live workload signals rather than static configuration.
KvCacheCompressionPolicyadapts KV-cache precision across FP16 → Q8 → Q4 based on cache-pressure thresholds. Pressure is tracked with an EWMA and gated by explicit hysteresis so the tier does not thrash near a boundary. The README and API also expose it under the short nameKvCachePolicy, whose.observe(pressure)method returns aKvCacheLevel.AdaptiveLookaheadis a speculative-decoding draft-length controller. It updates the lookaheadkfrom a running EWMA of accepted-tokens-per-step, clamped to a configurable[min, max]window.SpeculativeDecodergained awith_adaptive(...)constructor that refreshes draftkafter each step, fed by.observe_step(proposed, accepted).RequestRateTrackerrecords per-request EMA tokens/sec, p50/p95 inter-token latency, and queue-wait time, all surfaced throughInferenceMetrics.RequestRateAggregatorrolls per-request snapshots into the workload gauges.RequestIdis a UUIDv4-style 128-bit hex identifier produced by a deterministic xorshift64-based generator, from theoxibonsai_runtime::request_idmodule. It carriesas_bytes()/from_bytes()(big-endian) andas_uuid()/from_uuid()for tracing-span correlation.
New Prometheus gauges in InferenceMetrics expose the live picture:
oxibonsai_request_tokens_per_secondoxibonsai_inter_token_latency_p50_secondsoxibonsai_inter_token_latency_p95_secondsoxibonsai_queue_wait_secondsoxibonsai_kv_cache_compression_level
The engine surface grew to match:
InferenceEngine::generate_tracked(&[u32], usize, &mut RequestRateTracker)InferenceEngine::generate_with_request_id(RequestId, &[u32], usize) -> (Vec<u32>, RequestRateTracker)InferenceEngine::set_rate_aggregator(Arc<RequestRateAggregator>)
On the HTTP side, GET /admin/workload-stats returns a JSON snapshot of two things:
- the
RequestRateAggregator(TBT p50/p95, EWMA tokens/sec, queue-wait, completed requests) - the
KvCachePolicystate (current tier, smoothed pressure, transition counters)
The OpenAI server now honors an X-Request-ID header. Client-supplied ids in UUID 8-4-4-4-12 form or as 32-char hex are echoed back verbatim, while absent or malformed headers trigger an auto-generated RequestId. Both streaming and non-streaming responses carry the header, and server tracing spans record request_id for end-to-end correlation. The header constant is REQUEST_ID_HEADER, with helpers resolve_request_id(&HeaderMap) and request_id_header_map(RequestId). See examples/runtime_controllers.rs and the criterion microbenchmarks in benches/controllers_bench.rs.
Theme 2 — New quantization families
OxiBonsai now loads and runs standard FP8 and K-quant GGUF models in addition to the sub-2-bit Bonsai formats, with kernels spanning oxibonsai-kernels, oxibonsai-model, and oxibonsai-core.
FP8 family. The BlockFP8E4M3 and BlockFP8E5M2 block types pack 32 weights plus an FP16 scale into 34-byte blocks, with bit-exact IEEE-754-style encode/decode using RNE rounding.
- They map to PrismML’s FP8 extension GGUF type IDs 43 (
F8_E4M3) and 44 (F8_E5M2) at roughly 8.5 bits per weight. - Reference scalar kernels behind the
Fp8Kerneltrait are joined by Metal GEMV and Metal batch-prefill kernels (AoS 34-byte blocks, one simdgroup per output row). - CUDA batch-prefill arrives via
try_cuda_prefill_fp8. - FP8 models route through batch GEMM on CUDA/Metal with a sequential GEMV fallback.
K-quant family. Standard K-quant GGUF formats now run with dedicated GPU kernels.
- Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, and Q8_K gain CUDA GEMV kernels and CUDA batch-prefill (
try_cuda_prefill_k_quant, 18 NVRTC kernels, 3 per format). - Q4_0/Q8_0 also get CUDA full-forward dispatch plus batch prefill (
try_cuda_prefill_q_std). - K-quant batch GEMM requires
hidden_size % 256 == 0.
All batch kernels use a “cap-of-8” outer-loop pattern, so arbitrary batch sizes are processed correctly with no silent truncation past eight columns.
Theme 3 — Constrained and structured decoding
0.1.4 lets you constrain the sampler at the token level so the output is structurally valid by construction.
AllowListConstraintrestricts output to a finite set of token-id sequences — ideal for multiple-choice forced answers.SequenceConstraintforces output to follow a specific token-id sequence exactly.LengthConstraintenforces hard[min_len, max_len]output-length bounds with an optionalstop_token.- A new BNF grammar engine in the
grammar/module provides full context-free-grammar support. Its components:- an Earley recognizer that handles arbitrary CFGs — including left-recursive and ambiguous grammars — through set-based memoization;
- a hand-rolled BNF text parser covering alternation, recursion, comments, line continuation, and escape sequences;
GrammarConstraint, which implements theTokenConstrainttrait;- pre-computed FIRST sets that give O(1) next-byte lookahead, exposed directly through the
next_byte_set()API; - pre-canned grammars for arithmetic, aⁿbⁿ, CSV rows, and minimal JSON.
Every source file in the workspace stays well under the 2000-line-per-file policy with substantial headroom, so the controllers, kernels, and grammar engine remain easy to read and audit.
Getting Started
Install the CLI, which provides the oxibonsai binary:
cargo install oxibonsai-cli # installs the `oxibonsai` binary
Wire up the self-tuning controllers in a few lines:
use oxibonsai_runtime::{KvCachePolicy, AdaptiveLookahead, AdaptiveLookaheadConfig};
// KV cache policy: FP16 ↔ Q8 ↔ Q4 driven by EWMA pressure with hysteresis.
let kv = KvCachePolicy::default();
let level = kv.observe(0.92); // → escalates to Q8 once smoothed pressure crosses 0.80
// Speculative-decoding draft length: continuously updated from acceptance EWMA.
let mut k = AdaptiveLookahead::new(AdaptiveLookaheadConfig::default());
k.observe_step(5, 4); // proposed=5, accepted=4 → k drifts toward 5
Run the full controllers example:
cargo run --example runtime_controllers
And start the production serving entry point:
oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf
What’s New in 0.1.4
- Self-tuning runtime. Adaptive KV-cache compression (FP16 → Q8 → Q4 with EWMA pressure and hysteresis) and adaptive speculative-decoding draft length react to live load instead of fixed knobs.
- Full observability. New Prometheus gauges for tokens/sec, inter-token latency p50/p95, queue-wait, and KV-cache compression level, plus per-request tracing via
X-Request-IDand aGET /admin/workload-statsJSON snapshot. - More model formats. Standard FP8 (
F8_E4M3/F8_E5M2) and K-quant (Q2_K…Q8_K) GGUF models now load and run alongside the sub-2-bit Bonsai formats, with CUDA and Metal batch kernels. - Guaranteed-valid output. Grammar-constrained decoding with an Earley recognizer and BNF parser, plus allow-list, sequence, and length constraints, can force structurally valid output such as always-valid JSON.
- Policy-clean codebase. Every source file in the workspace stays under the 2000-line-per-file policy with substantial headroom.
Tips
- Scrape the new Prometheus gauges to watch behavior under load:
oxibonsai_request_tokens_per_second,oxibonsai_inter_token_latency_p50_seconds,oxibonsai_inter_token_latency_p95_seconds,oxibonsai_queue_wait_seconds, andoxibonsai_kv_cache_compression_level. - Hit
GET /admin/workload-statsfor a live JSON snapshot of theRequestRateAggregator(TBT p50/p95, EWMA tokens/sec, queue-wait, completed requests) andKvCachePolicystate (tier, smoothed pressure, transition counters). - Set the
X-Request-IDheader (UUID8-4-4-4-12or 32-char hex) on requests for end-to-end tracing — it is echoed back verbatim on both streaming and non-streaming responses, and auto-generated when absent or malformed. - Guarantee valid JSON by attaching a
GrammarConstraintbuilt from the minimal-JSON pre-canned grammar, or write your own BNF (alternation, recursion, comments) and let the Earley recognizer enforce it; usenext_byte_set()to inspect allowed next bytes. - Run standard FP8 (
F8_E4M3/F8_E5M2) and K-quant (Q2_K…Q8_K) GGUF models directly — just remember K-quant batch GEMM needshidden_size % 256 == 0. - Let
SpeculativeDecoder::with_adaptive(...)andAdaptiveLookaheadtune draft length automatically, and callgenerate_trackedorgenerate_with_request_idto get aRequestRateTrackerback for client-side telemetry.
This is the foundation
OxiBonsai is built entirely on the COOLJAPAN ecosystem — SciRS2, OxiBLAS, OxiFFT, OxiARC, and OxiONNX — to run PrismML’s Bonsai sub-2-bit models with zero FFI and zero C/C++/Fortran runtime. 0.1.4 takes that sovereign stack from fast to production-grade: self-tuning, observable, format-rich, and structurally safe, with no foreign code anywhere in the default build.
Repository: https://github.com/cool-japan/oxibonsai
Star the repo if you want production-grade sovereign serving — an inference engine that tunes itself, reports everything, and answers to no foreign runtime.
Pure Rust sovereign inference, ready for real traffic, is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ May 16, 2026