COOLJAPAN
← All posts

OxiLLaMa 0.1.1 Released — FlashAttention, True Continuous Batching, and 5 New Architectures in Pure Rust

OxiLLaMa is a Pure Rust LLM inference engine — the sovereign alternative to llama.cpp. Version 0.1.1 ships a tiled FlashAttention CPU kernel, true continuous batching with zero padding waste, fused dequant+GEMM (~12% Q4_K_M decode gain), 5 new architectures (DBRX, Grok-1, Mamba-2, DeepSeek-V3, and more), and GPU coverage extended to 10 quantization types.

release oxillama llm-inference gguf llama.cpp pure-rust flash-attention continuous-batching quantization scirs2

Fast, safe LLM inference with no C, no C++, and no FFI — now with FlashAttention, true continuous batching, and five new model architectures.

Today we released OxiLLaMa 0.1.1 — an incremental update to our Pure Rust LLM inference engine that lands a tiled FlashAttention CPU kernel, true continuous batching, fused dequantization, and five new model architectures, all without touching a line of C.

No C. No C++. No Fortran. No FFI. No system libraries. Where llama.cpp leans on a sprawling C++ toolchain and platform-specific build steps, OxiLLaMa is written entirely in Rust and compiles to a single static binary (or to WebAssembly) that runs everywhere — native servers, browsers, and embedded targets — from one codebase. It is built on the COOLJAPAN stack: SciRS2 for tensor primitives and neural ops, OxiBLAS for Pure Rust GEMM/GEMV, OxiFFT for Pure Rust FFT (used to accelerate RoPE), and MeCrab for Japanese tokenization. As of this release, OxiLLaMa is ~87,400 lines of Pure Rust across 11 crates, with 1,898 tests passing.

OxiLLaMa is “Pure Rust LLM Inference Engine — The Sovereign Alternative to llama.cpp.” Zero C/C++/Fortran, zero FFI, zero system libraries.

Why OxiLLaMa 0.1.1 matters

llama.cpp is remarkable engineering, but its foundations show their age the moment you push past a happy-path desktop build. It is C and C++, which means manual memory management, the ever-present risk of segfaults and undefined behavior, and a build that drags in heavy native dependencies. WebAssembly and embedded support are afterthoughts, and integrating it cleanly into a Rust application means wrapping an FFI boundary you have to babysit.

OxiLLaMa starts from the opposite premise: memory safety by construction, one toolchain, and a binary that drops into a Rust program as an ordinary crate. Version 0.1.1 turns that foundation into measurable performance:

Technical Deep Dive: the inference pipeline, crate by crate

OxiLLaMa’s 11 crates form a clean pipeline from model file to streamed tokens. Here is where 0.1.1 added muscle.

oxillama-gguf — loading. The GGUF loader gained serious resilience. GgufModel::resume() reads an adjacent .oxiresume sidecar checkpoint, validates the last-valid byte offset, and exposes a ResumeHandle::finish() path so an interrupted HuggingFace pull continues instead of restarting. ShardedGgufModel::load_sharded() auto-discovers every HuggingFace-named sibling shard (<base>-NNNNN-of-MMMMM.gguf) from a single shard path and presents them as one unified logical model. An optional quantize-on-the-fly pass dequantizes and re-quantizes tensors to a target format during load.

oxillama-quant — dequantization. AVX2 kernels for Q4_1 / Q5_0 / Q5_1 / Q8_1 complete full AVX2 coverage of the legacy quant types. On Apple Silicon, NEON now accelerates Q4_1 / Q5_0 / Q5_1 / Q8_1 / Q2_K / Q3_K plus all 11 IQ types. AVX-512 picks up TQ1_0 / TQ2_0 / Q5_0 / Q8_K, extending AVX-512 to 10 types.

oxillama-arch — model graphs. Five new architectures arrived: DBRX (16-expert MoE, top-4), Grok-1 (8-expert MoE, top-2), DeepSeek-V3’s sigmoid-with-bias MoE scoring, Mamba-2 (selective scan with a learned Δ), plus OLMo2, Yi, Granite, MiniCPM, and InternLM3. A new SequenceState trait generalises the KV-cache slot interface to state-space models, and a MlaLayer primitive brings Multi-head Latent Attention with decoupled RoPE. A full DeepSeekV2Model combines MLA attention with DeepSeekMoE sparse routing (N shared experts plus top-K routed experts) and 3-bit/8-bit quantized expert dispatch.

oxillama-runtime — execution and sessions. The runtime added EngineSnapshot (snapshot.rs): InferenceEngine::snapshot() captures the full KV cache and sampler RNG state into a byte blob, and InferenceEngine::resume() validates the model fingerprint and restores it — session persistence across process restarts. The blob is serialized with oxicode, the COOLJAPAN Pure Rust codec, in line with our workspace serialization policy.

oxillama-gpu — acceleration. Beyond the four new GEMV kernels, 0.1.1 adds Q2_K / Q3_K / Q8_K / IQ4_XS GEMV WGSL shaders, a tiled GEMM WGSL shader (TILE_M/N=32, TILE_K=16, shared-memory cooperative load) for prefill, and a fused attention WGSL kernel that does QK + softmax + AV in a single GPU dispatch. An async WebGPU bridge (gpu_bridge.rs) exposes initWebGpuDevice(), webgpuDequantQ4_0Async(), and webgpuGemvAsync() via wasm_bindgen_futures::JsFuture for real GPU dispatch in WebGPU-capable browsers.

oxillama-server — serving. The server exposes an OpenAI-compatible HTTP API (POST /v1/chat/completions, /v1/completions, /v1/embeddings) with SSE streaming and a [DONE] sentinel, plus llama.cpp CLI flag aliases (-n/--n-predict, --temperature, -c/--n-ctx, --seed, --repeat-penalty, --min-p) so existing tooling feels at home.

On x86-64 (8 cores, AVX2), OxiLLaMa targets at least 80% of llama.cpp throughput: LLaMA-3-8B Q4_K_M at ~25 t/s against llama.cpp’s ~30 t/s, Mistral-7B Q4_K_M at >= 27 t/s against ~32 t/s, and OxiBonsai’s Bonsai-8B Q1_0_G128 1-bit quant at >= 22 t/s against ~25 t/s.

Getting Started

Add the library to your project:

cargo add oxillama

The fastest way to try it is the CLI. Run a prompt directly:

oxillama run --model path/to/model.gguf \
  --prompt "Explain quantum computing in simple terms" \
  --max-tokens 256 --temp 0.7

Start an OpenAI-compatible server:

oxillama serve --model path/to/model.gguf --host 0.0.0.0 --port 8080

Or inspect a model file:

oxillama info --model path/to/model.gguf

Prefer to embed it? The examples/load_and_generate.rs flow loads a GGUF, configures the sampler, and streams tokens to stdout — load the model, build an engine, and print each token as it arrives.

What’s New in 0.1.1

Tips

This is the foundation

OxiLLaMa fits squarely into the COOLJAPAN Pure Rust stack as it stands today. Tensor primitives and neural ops come from SciRS2 (scirs2-core / linalg / neural 0.4.2); matrix math runs on OxiBLAS 0.2.1; RoPE acceleration uses OxiFFT 0.2.0; Japanese text is tokenized with MeCrab. The Bonsai-8B Q1_0_G128 1-bit quant path comes from OxiBonsai, and session snapshots are serialized with oxicode 0.2.1. Every layer is Pure Rust, every dependency compiles without a C toolchain, and the whole thing collapses to a single binary or a WASM module.

Repository: https://github.com/cool-japan/oxillama

Star the repo if you want fast, memory-safe LLM inference without the C++ build chain. Pure Rust LLM inference is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ April 24, 2026

↑ Back to all posts