COOLJAPAN
2026-04-15

OxiLLaMa 0.1.0 Released — Pure Rust LLM Inference Engine, Sovereign Alternative to llama.cpp

Complete GGUF loading + 25 quantized formats + OpenAI-compatible API server — all in pure Rust. 56.2k SLoC, 11 crates, no C/C++/Fortran, built on SciRS2/OxiBLAS/OxiFFT. ≥80% of llama.cpp throughput, WASM/GPU/Python bindings, LLaMA/Mistral/Gemma/Phi/LLaVA support. The sovereign LLM inference layer for SciRS2 and the entire COOLJAPAN ecosystem (now 21M+ SLoC total).

The LLM inference foundation of the COOLJAPAN ecosystem just became fully sovereign.

Today we released OxiLLaMa 0.1.0 — a complete, production-grade pure Rust LLM inference engine that is the clean-room, memory-safe alternative to llama.cpp.

No C. No C++. No Fortran. No FFI. No system libraries.
No unsafe code in hot paths. No build hell.
Just clean, memory-safe, high-performance LLM inference that compiles to a single static binary (or WASM) and runs everywhere — from laptops to browsers to edge devices to cloud GPUs.

Why OxiLLaMa 0.1.0 is a game changer

For years, fast local LLM inference meant depending on the excellent but C++-based llama.cpp (with its complex build system, memory risks, and limited portability).

These tools are powerful but suffer from:

OxiLLaMa 0.1.0 ends all of that.

It delivers ≥80% of llama.cpp throughput while being 100% memory-safe and fully auditable.
Notable results (target on 8-core AVX2):

Technical Deep Dive: How We Built a Production-Grade LLM Engine in Pure Rust

The architecture is trait-based and built directly on the COOLJAPAN stack:

  1. GGUF Engine (oxillama-gguf)
    Full GGUF v3 parser and tensor loader.

  2. Quantization Layer (oxillama-quant)
    25 formats with SIMD kernels: Q4_0–Q8_1, all K-Quants (Q2_K–Q6_K), I-Quants (IQ1–IQ4), Q1_0_G128 (OxiBonsai), FP16/BF16/FP32.

  3. Model Architectures (oxillama-arch)
    8 families: LLaMA (3.x/4.x + Mixtral MoE), Qwen3/Bonsai, Mistral, Gemma 2/3, Phi-3/4, Command-R, StarCoder, LLaVA-1.5 (multimodal).

  4. Runtime & Server
    KV cache, sampling engine, OpenAI-compatible HTTP API server (oxillama-server).

  5. Hardware & Bindings
    Optional wgpu GPU backend, WASM target, PyO3 Python bindings, CLI (oxillama run/serve/info).

Key Rust advantages:

What’s inside 0.1.0 (released April 15)

This is the foundation

OxiLLaMa is now the official LLM inference backend for the entire COOLJAPAN stack (total ecosystem: 21M+ SLoC Rust, 597 crates, 40+ production-grade libraries):

Repository: https://github.com/cool-japan/oxillama

Star the repo if you want sovereign, memory-safe LLM inference without llama.cpp’s C++ dependencies.

The era of “just compile llama.cpp” is over.

Pure Rust LLM inference is here — fast, safe, auditable, and sovereign.

KitaSan at COOLJAPAN OÜ April 15, 2026