#gguf | COOLJAPAN Blog

May 5, 2026 · 11 min

OxiLLaMa 0.1.3 Released — BLOOM + Phi-3.5-MoE, a 5-Stage Advanced Sampler Suite, and /v1/responses with Zero-Copy Torch Interop

OxiLLaMa 0.1.3 is the Pure Rust LLM inference engine and sovereign alternative to llama.cpp. This release adds BLOOM + Phi-3.5-MoE architectures (now 27 total), a 5-stage advanced sampler suite (DRY/XTC/TypicalP/TopA/Eta) that is byte-identical at defaults, embedding pooling modes, a drop-in /v1/responses API with per-API-key rate limiting, AVX-512 IQ kernels at ~2x per-iteration throughput, GPU-resident sampling kernels, and zero-copy DLPack PyTorch interop — 2,461 tests passing.

releaseoxillamallm-inference

Apr 25, 2026 · 7 min

OxiLLaMa 0.1.2 Released — HuggingFace Hub Pulls, Full-Screen TUI Chat, and Conversation Save/Resume in Pure Rust

OxiLLaMa 0.1.2 is the Pure Rust LLM inference engine and sovereign alternative to llama.cpp. This release adds `oxillama hub pull/list/rm` (hf-hub, no Python), a full-screen TUI chat with live streaming (ratatui), conversation save/resume serialized via oxicode with SHA-256 integrity, and real weight loading for DBRX, Grok-1, and Mamba-2.

releaseoxillamallm-inference

Apr 24, 2026 · 8 min

OxiLLaMa 0.1.1 Released — FlashAttention, True Continuous Batching, and 5 New Architectures in Pure Rust

OxiLLaMa is a Pure Rust LLM inference engine — the sovereign alternative to llama.cpp. Version 0.1.1 ships a tiled FlashAttention CPU kernel, true continuous batching with zero padding waste, fused dequant+GEMM (~12% Q4_K_M decode gain), 5 new architectures (DBRX, Grok-1, Mamba-2, DeepSeek-V3, and more), and GPU coverage extended to 10 quantization types.

releaseoxillamallm-inference

Apr 19, 2026 · 5 min

OxiBonsai 0.1.2 Released — Import onnx-community Ternary ONNX to GGUF in One Command, No Python

OxiBonsai 0.1.2 adds ONNX ingestion: pull an onnx-community Ternary ONNX release (MatMulNBits, bits=2) and repack it straight to OxiBonsai's GGUF TQ2_0_g128 with a single command — driven by the pure-Rust oxionnx-proto reader, no Python and no onnxruntime. Sub-2-bit sovereign AI inference for the COOLJAPAN ecosystem.

releaseoxibonsaillm

Apr 15, 2026 · 3 min

OxiLLaMa 0.1.0 Released — Pure Rust LLM Inference Engine, Sovereign Alternative to llama.cpp

Complete GGUF loading + 25 quantized formats + OpenAI-compatible API server — all in pure Rust. 56.2k SLoC, 11 crates, no C/C++/Fortran, built on SciRS2/OxiBLAS/OxiFFT. ≥80% of llama.cpp throughput, WASM/GPU/Python bindings, LLaMA/Mistral/Gemma/Phi/LLaVA support. The sovereign LLM inference layer for SciRS2 and the entire COOLJAPAN ecosystem (now 21M+ SLoC total).

releaseoxillamallm-inference

Apr 13, 2026 · 8 min

OxiBonsai 0.1.0 Released — The World's First Pure Rust 1-Bit LLM Inference Engine

An 8B-parameter language model at roughly 1 bit per weight, running from a single static Rust binary with no llama.cpp, no BLAS, no C/C++/Fortran. OxiBonsai 0.1.0 debuts sub-2-bit Pure Rust sovereign AI inference for the COOLJAPAN ecosystem — SIMD-accelerated, Rayon-parallel, and OpenAI-compatible out of the box.

releaseoxibonsaillm