OxiBonsai 0.1.4 Released — Production-Grade Sovereign Serving: Self-Tuning Runtime, Prometheus + X-Request-ID Observability, FP8 & K-Quant, and Grammar-Constrained Output

The engine that serves itself: a sub-2-bit inference runtime that tunes its own KV-cache and draft length under load, reports every request to Prometheus, and can guarantee the JSON it emits is valid — all in Pure Rust.

Today we released OxiBonsai 0.1.4 — a production-grade serving upgrade that turns the fast KV-cache reuse of 0.1.3 into a self-tuning, fully observable runtime with new FP8 and K-quant model support and grammar-constrained decoding.

No llama.cpp. No BLAS. No C, no C++, no Fortran. OxiBonsai 0.1.4 remains the first Pure Rust, zero-FFI inference engine for PrismML’s sub-2-bit Bonsai model family — the 1-bit line (Q1_0_g128) and the ternary line (TQ2_0_g128) — running on CPU SIMD (AVX2/AVX-512/NEON/WASM), Apple Silicon (Metal), and NVIDIA (CUDA NVRTC), built end-to-end on the COOLJAPAN ecosystem.

Why OxiBonsai 0.1.4 matters

Version 0.1.3 made KV-cache reuse fast with the PrefixCachedEngine, tokenizer auto-detect, and GPU weight caching. But fast is not the same as production-ready. Serving real traffic means the engine has to behave when load spikes, when caches fill, and when downstream systems need to know exactly what happened to each request.

0.1.4 closes that gap. The engine now self-tunes under load — KV-cache precision and speculative-decoding draft length adapt automatically instead of being fixed knobs. It is fully observable — every request flows through Prometheus gauges and a per-request X-Request-ID tracing span. It expands the supported formats — standard FP8 and K-quant GGUF models load and run alongside the sub-2-bit Bonsai formats. And it can guarantee structurally valid output — a grammar-constrained sampler that can force always-valid JSON at the token level. That is the difference between a fast research engine and one you can put behind a load balancer.

Technical Deep Dive

Theme 1 — Production runtime controllers and observability

The new controllers live in oxibonsai-runtime.

They are designed to react to live workload signals rather than static configuration.

KvCacheCompressionPolicy adapts KV-cache precision across FP16 → Q8 → Q4 based on cache-pressure thresholds. Pressure is tracked with an EWMA and gated by explicit hysteresis so the tier does not thrash near a boundary. The README and API also expose it under the short name KvCachePolicy, whose .observe(pressure) method returns a KvCacheLevel.
AdaptiveLookahead is a speculative-decoding draft-length controller. It updates the lookahead k from a running EWMA of accepted-tokens-per-step, clamped to a configurable [min, max] window. SpeculativeDecoder gained a with_adaptive(...) constructor that refreshes draft k after each step, fed by .observe_step(proposed, accepted).
RequestRateTracker records per-request EMA tokens/sec, p50/p95 inter-token latency, and queue-wait time, all surfaced through InferenceMetrics. RequestRateAggregator rolls per-request snapshots into the workload gauges.
RequestId is a UUIDv4-style 128-bit hex identifier produced by a deterministic xorshift64-based generator, from the oxibonsai_runtime::request_id module. It carries as_bytes()/from_bytes() (big-endian) and as_uuid()/from_uuid() for tracing-span correlation.

New Prometheus gauges in InferenceMetrics expose the live picture:

oxibonsai_request_tokens_per_second
oxibonsai_inter_token_latency_p50_seconds
oxibonsai_inter_token_latency_p95_seconds
oxibonsai_queue_wait_seconds
oxibonsai_kv_cache_compression_level

The engine surface grew to match:

InferenceEngine::generate_tracked(&[u32], usize, &mut RequestRateTracker)
InferenceEngine::generate_with_request_id(RequestId, &[u32], usize) -> (Vec<u32>, RequestRateTracker)
InferenceEngine::set_rate_aggregator(Arc<RequestRateAggregator>)

On the HTTP side, GET /admin/workload-stats returns a JSON snapshot of two things:

the RequestRateAggregator (TBT p50/p95, EWMA tokens/sec, queue-wait, completed requests)
the KvCachePolicy state (current tier, smoothed pressure, transition counters)

The OpenAI server now honors an X-Request-ID header. Client-supplied ids in UUID 8-4-4-4-12 form or as 32-char hex are echoed back verbatim, while absent or malformed headers trigger an auto-generated RequestId. Both streaming and non-streaming responses carry the header, and server tracing spans record request_id for end-to-end correlation. The header constant is REQUEST_ID_HEADER, with helpers resolve_request_id(&HeaderMap) and request_id_header_map(RequestId). See examples/runtime_controllers.rs and the criterion microbenchmarks in benches/controllers_bench.rs.

Theme 2 — New quantization families

OxiBonsai now loads and runs standard FP8 and K-quant GGUF models in addition to the sub-2-bit Bonsai formats, with kernels spanning oxibonsai-kernels, oxibonsai-model, and oxibonsai-core.

FP8 family. The BlockFP8E4M3 and BlockFP8E5M2 block types pack 32 weights plus an FP16 scale into 34-byte blocks, with bit-exact IEEE-754-style encode/decode using RNE rounding.

They map to PrismML’s FP8 extension GGUF type IDs 43 (F8_E4M3) and 44 (F8_E5M2) at roughly 8.5 bits per weight.
Reference scalar kernels behind the Fp8Kernel trait are joined by Metal GEMV and Metal batch-prefill kernels (AoS 34-byte blocks, one simdgroup per output row).
CUDA batch-prefill arrives via try_cuda_prefill_fp8.
FP8 models route through batch GEMM on CUDA/Metal with a sequential GEMV fallback.

K-quant family. Standard K-quant GGUF formats now run with dedicated GPU kernels.

Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, and Q8_K gain CUDA GEMV kernels and CUDA batch-prefill (try_cuda_prefill_k_quant, 18 NVRTC kernels, 3 per format).
Q4_0/Q8_0 also get CUDA full-forward dispatch plus batch prefill (try_cuda_prefill_q_std).
K-quant batch GEMM requires hidden_size % 256 == 0.

All batch kernels use a “cap-of-8” outer-loop pattern, so arbitrary batch sizes are processed correctly with no silent truncation past eight columns.

Theme 3 — Constrained and structured decoding

0.1.4 lets you constrain the sampler at the token level so the output is structurally valid by construction.

AllowListConstraint restricts output to a finite set of token-id sequences — ideal for multiple-choice forced answers.
SequenceConstraint forces output to follow a specific token-id sequence exactly.
LengthConstraint enforces hard [min_len, max_len] output-length bounds with an optional stop_token.
A new BNF grammar engine in the grammar/ module provides full context-free-grammar support. Its components:
- an Earley recognizer that handles arbitrary CFGs — including left-recursive and ambiguous grammars — through set-based memoization;
- a hand-rolled BNF text parser covering alternation, recursion, comments, line continuation, and escape sequences;
- GrammarConstraint, which implements the TokenConstraint trait;
- pre-computed FIRST sets that give O(1) next-byte lookahead, exposed directly through the next_byte_set() API;
- pre-canned grammars for arithmetic, aⁿbⁿ, CSV rows, and minimal JSON.

Every source file in the workspace stays well under the 2000-line-per-file policy with substantial headroom, so the controllers, kernels, and grammar engine remain easy to read and audit.

Getting Started

Install the CLI, which provides the oxibonsai binary:

cargo install oxibonsai-cli   # installs the `oxibonsai` binary

Wire up the self-tuning controllers in a few lines:

use oxibonsai_runtime::{KvCachePolicy, AdaptiveLookahead, AdaptiveLookaheadConfig};

// KV cache policy: FP16 ↔ Q8 ↔ Q4 driven by EWMA pressure with hysteresis.
let kv = KvCachePolicy::default();
let level = kv.observe(0.92);  // → escalates to Q8 once smoothed pressure crosses 0.80

// Speculative-decoding draft length: continuously updated from acceptance EWMA.
let mut k = AdaptiveLookahead::new(AdaptiveLookaheadConfig::default());
k.observe_step(5, 4);  // proposed=5, accepted=4 → k drifts toward 5

Run the full controllers example:

cargo run --example runtime_controllers

And start the production serving entry point:

oxibonsai serve --model models/Ternary-Bonsai-1.7B.gguf

What’s New in 0.1.4

Self-tuning runtime. Adaptive KV-cache compression (FP16 → Q8 → Q4 with EWMA pressure and hysteresis) and adaptive speculative-decoding draft length react to live load instead of fixed knobs.
Full observability. New Prometheus gauges for tokens/sec, inter-token latency p50/p95, queue-wait, and KV-cache compression level, plus per-request tracing via X-Request-ID and a GET /admin/workload-stats JSON snapshot.
More model formats. Standard FP8 (F8_E4M3 / F8_E5M2) and K-quant (Q2_K…Q8_K) GGUF models now load and run alongside the sub-2-bit Bonsai formats, with CUDA and Metal batch kernels.
Guaranteed-valid output. Grammar-constrained decoding with an Earley recognizer and BNF parser, plus allow-list, sequence, and length constraints, can force structurally valid output such as always-valid JSON.
Policy-clean codebase. Every source file in the workspace stays under the 2000-line-per-file policy with substantial headroom.

Tips

Scrape the new Prometheus gauges to watch behavior under load: oxibonsai_request_tokens_per_second, oxibonsai_inter_token_latency_p50_seconds, oxibonsai_inter_token_latency_p95_seconds, oxibonsai_queue_wait_seconds, and oxibonsai_kv_cache_compression_level.
Hit GET /admin/workload-stats for a live JSON snapshot of the RequestRateAggregator (TBT p50/p95, EWMA tokens/sec, queue-wait, completed requests) and KvCachePolicy state (tier, smoothed pressure, transition counters).
Set the X-Request-ID header (UUID 8-4-4-4-12 or 32-char hex) on requests for end-to-end tracing — it is echoed back verbatim on both streaming and non-streaming responses, and auto-generated when absent or malformed.
Guarantee valid JSON by attaching a GrammarConstraint built from the minimal-JSON pre-canned grammar, or write your own BNF (alternation, recursion, comments) and let the Earley recognizer enforce it; use next_byte_set() to inspect allowed next bytes.
Run standard FP8 (F8_E4M3 / F8_E5M2) and K-quant (Q2_K…Q8_K) GGUF models directly — just remember K-quant batch GEMM needs hidden_size % 256 == 0.
Let SpeculativeDecoder::with_adaptive(...) and AdaptiveLookahead tune draft length automatically, and call generate_tracked or generate_with_request_id to get a RequestRateTracker back for client-side telemetry.

This is the foundation

OxiBonsai is built entirely on the COOLJAPAN ecosystem — SciRS2, OxiBLAS, OxiFFT, OxiARC, and OxiONNX — to run PrismML’s Bonsai sub-2-bit models with zero FFI and zero C/C++/Fortran runtime. 0.1.4 takes that sovereign stack from fast to production-grade: self-tuning, observable, format-rich, and structurally safe, with no foreign code anywhere in the default build.

Repository: https://github.com/cool-japan/oxibonsai

Star the repo if you want production-grade sovereign serving — an inference engine that tunes itself, reports everything, and answers to no foreign runtime.

Pure Rust sovereign inference, ready for real traffic, is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ May 16, 2026