#quantization | COOLJAPAN Blog

Jun 8, 2026 · 8 min

OxiBonsai 0.2.2 Released — An Interactive Image REPL with Inline Terminal Rendering

OxiBonsai 0.2.2 adds `oxibonsai repl`: a resident ImageSession that loads the DiT, VAE, and text encoder once and iterates on prompts without re-paying the load/dequant cost — with images shown inline in Ghostty via a pure-Rust kitty graphics protocol, a `:fast`/`:hq` preview→finalize loop, and documented per-platform GPU flags. Sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem.

releaseoxibonsaillm

Jun 6, 2026 · 7 min

OxiBonsai 0.2.1 Released — Minutes-Long Numeric Tests, Now Fast (and a VAE File Fix)

A quality-of-life and correctness release for OxiBonsai: optimized test/dev compile profiles turn minutes-long numeric tests fast while keeping float parity bit-stable, a VAE precheck fix that finally accepts a .safetensors file, and corrected HuggingFace asset paths. Sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem.

releaseoxibonsaillm

Jun 3, 2026 · 7 min

OxiBonsai 0.2.0 Released — Concurrent /serve, Byte-Identical CPU↔Metal, and Reproducible Images

OxiBonsai 0.2.0 opens the 0.2 series: a concurrent engine pool that shares one 1.16 GB embedding table across replicas, a CPU↔Metal byte-identical parity guard, a parity-first CUDA imagen backend (~3.2× to ~31.7s on A4000), --seed byte-exact reproducible images, and a stable-toolchain build — sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem.

releaseoxibonsaillm

Jun 2, 2026 · 10 min

OxiBonsai 0.1.5 Released — OxiBonsai Goes Multimodal: a Pure-Rust FLUX.2-Klein Text-to-Image Pipeline

OxiBonsai 0.1.5 adds the oxibonsai-image crate — the first pure-Rust, zero-FFI, C/C++/Fortran-free FLUX.2-Klein text-to-image pipeline (DiT + VAE + Qwen3-4B text encoder), parity-validated at cos≥0.999, with Metal flash-attention and ~52–62s end-to-end on an M3. Sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem, now spanning text and image.

releaseoxibonsaillm

May 16, 2026 · 9 min

OxiBonsai 0.1.4 Released — Production-Grade Sovereign Serving: Self-Tuning Runtime, Prometheus + X-Request-ID Observability, FP8 & K-Quant, and Grammar-Constrained Output

OxiBonsai 0.1.4 makes Pure Rust sub-2-bit inference production-grade for serving: adaptive KV-cache compression and adaptive speculative decoding that self-tune under load, full Prometheus observability with per-request X-Request-ID tracing, new FP8 and K-quant GGUF model support, and grammar-constrained decoding for guaranteed-valid JSON — sovereign AI inference for the COOLJAPAN ecosystem.

releaseoxibonsaillm

May 3, 2026 · 8 min

OxiBonsai 0.1.3 Released — Prefix-Cache-Aware Serving with Byte-Identical Warm Paths

OxiBonsai 0.1.3 makes sub-2-bit serving smarter: a prefix-cache-aware engine that reuses KV-cache across requests with byte-identical cold/warm parity, runtime tokenizer auto-detection, and a GPU weight cache that uploads once. Sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem.

releaseoxibonsaillm

Apr 24, 2026 · 8 min

OxiLLaMa 0.1.1 Released — FlashAttention, True Continuous Batching, and 5 New Architectures in Pure Rust

OxiLLaMa is a Pure Rust LLM inference engine — the sovereign alternative to llama.cpp. Version 0.1.1 ships a tiled FlashAttention CPU kernel, true continuous batching with zero padding waste, fused dequant+GEMM (~12% Q4_K_M decode gain), 5 new architectures (DBRX, Grok-1, Mamba-2, DeepSeek-V3, and more), and GPU coverage extended to 10 quantization types.

releaseoxillamallm-inference

Apr 19, 2026 · 5 min

OxiBonsai 0.1.2 Released — Import onnx-community Ternary ONNX to GGUF in One Command, No Python

OxiBonsai 0.1.2 adds ONNX ingestion: pull an onnx-community Ternary ONNX release (MatMulNBits, bits=2) and repack it straight to OxiBonsai's GGUF TQ2_0_g128 with a single command — driven by the pure-Rust oxionnx-proto reader, no Python and no onnxruntime. Sub-2-bit sovereign AI inference for the COOLJAPAN ecosystem.

releaseoxibonsaillm

Apr 18, 2026 · 7 min

OxiBonsai 0.1.1 Released — Sub-2-Bit Inference Goes GPU, and the Ternary Line Lands

Five days after its 1-bit debut, OxiBonsai grows GPUs: a native CUDA NVRTC backend (~21.9 tok/s on Ternary-Bonsai-1.7B, RTX 3060) and a fused Metal full-forward path (~50 tok/s, ~13x speedup) — plus the new ternary TQ2_0_g128 quant family, with NEON/AVX2/AVX-512 GEMV so it flies on CPU too. Sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem, still with no llama.cpp, no BLAS, no C/Fortran.

releaseoxibonsaillm

Apr 15, 2026 · 3 min

OxiLLaMa 0.1.0 Released — Pure Rust LLM Inference Engine, Sovereign Alternative to llama.cpp

Complete GGUF loading + 25 quantized formats + OpenAI-compatible API server — all in pure Rust. 56.2k SLoC, 11 crates, no C/C++/Fortran, built on SciRS2/OxiBLAS/OxiFFT. ≥80% of llama.cpp throughput, WASM/GPU/Python bindings, LLaMA/Mistral/Gemma/Phi/LLaVA support. The sovereign LLM inference layer for SciRS2 and the entire COOLJAPAN ecosystem (now 21M+ SLoC total).

releaseoxillamallm-inference

Apr 13, 2026 · 8 min

OxiBonsai 0.1.0 Released — The World's First Pure Rust 1-Bit LLM Inference Engine

An 8B-parameter language model at roughly 1 bit per weight, running from a single static Rust binary with no llama.cpp, no BLAS, no C/C++/Fortran. OxiBonsai 0.1.0 debuts sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem — SIMD-accelerated, Rayon-parallel, and OpenAI-compatible out of the box.

releaseoxibonsaillm