Complete GGUF loading + 25 quantized formats + OpenAI-compatible API server — all in pure Rust. 56.2k SLoC, 11 crates, no C/C++/Fortran, built on SciRS2/OxiBLAS/OxiFFT. ≥80% of llama.cpp throughput, WASM/GPU/Python bindings, LLaMA/Mistral/Gemma/Phi/LLaVA support. The sovereign LLM inference layer for SciRS2 and the entire COOLJAPAN ecosystem (now 21M+ SLoC total).
The LLM inference foundation of the COOLJAPAN ecosystem just became fully sovereign.
Today we released OxiLLaMa 0.1.0 — a complete, production-grade pure Rust LLM inference engine that is the clean-room, memory-safe alternative to llama.cpp.
No C. No C++. No Fortran. No FFI. No system libraries.
No unsafe code in hot paths. No build hell.
Just clean, memory-safe, high-performance LLM inference that compiles to a single static binary (or WASM) and runs everywhere — from laptops to browsers to edge devices to cloud GPUs.
For years, fast local LLM inference meant depending on the excellent but C++-based llama.cpp (with its complex build system, memory risks, and limited portability).
These tools are powerful but suffer from:
OxiLLaMa 0.1.0 ends all of that.
It delivers ≥80% of llama.cpp throughput while being 100% memory-safe and fully auditable.
Notable results (target on 8-core AVX2):
The architecture is trait-based and built directly on the COOLJAPAN stack:
GGUF Engine (oxillama-gguf)
Full GGUF v3 parser and tensor loader.
Quantization Layer (oxillama-quant)
25 formats with SIMD kernels: Q4_0–Q8_1, all K-Quants (Q2_K–Q6_K), I-Quants (IQ1–IQ4), Q1_0_G128 (OxiBonsai), FP16/BF16/FP32.
Model Architectures (oxillama-arch)
8 families: LLaMA (3.x/4.x + Mixtral MoE), Qwen3/Bonsai, Mistral, Gemma 2/3, Phi-3/4, Command-R, StarCoder, LLaVA-1.5 (multimodal).
Runtime & Server
KV cache, sampling engine, OpenAI-compatible HTTP API server (oxillama-server).
Hardware & Bindings
Optional wgpu GPU backend, WASM target, PyO3 Python bindings, CLI (oxillama run/serve/info).
Key Rust advantages:
OxiLLaMa is now the official LLM inference backend for the entire COOLJAPAN stack (total ecosystem: 21M+ SLoC Rust, 597 crates, 40+ production-grade libraries):
Repository: https://github.com/cool-japan/oxillama
Star the repo if you want sovereign, memory-safe LLM inference without llama.cpp’s C++ dependencies.
The era of “just compile llama.cpp” is over.
Pure Rust LLM inference is here — fast, safe, auditable, and sovereign.
— KitaSan at COOLJAPAN OÜ April 15, 2026