OxiLLaMa 0.1.1 Released — FlashAttention, True Continuous Batching, and 5 New Architectures in Pure Rust

Fast, safe LLM inference with no C, no C++, and no FFI — now with FlashAttention, true continuous batching, and five new model architectures.

Today we released OxiLLaMa 0.1.1 — an incremental update to our Pure Rust LLM inference engine that lands a tiled FlashAttention CPU kernel, true continuous batching, fused dequantization, and five new model architectures, all without touching a line of C.

No C. No C++. No Fortran. No FFI. No system libraries. Where llama.cpp leans on a sprawling C++ toolchain and platform-specific build steps, OxiLLaMa is written entirely in Rust and compiles to a single static binary (or to WebAssembly) that runs everywhere — native servers, browsers, and embedded targets — from one codebase. It is built on the COOLJAPAN stack: SciRS2 for tensor primitives and neural ops, OxiBLAS for Pure Rust GEMM/GEMV, OxiFFT for Pure Rust FFT (used to accelerate RoPE), and MeCrab for Japanese tokenization. As of this release, OxiLLaMa is ~87,400 lines of Pure Rust across 11 crates, with 1,898 tests passing.

OxiLLaMa is “Pure Rust LLM Inference Engine — The Sovereign Alternative to llama.cpp.” Zero C/C++/Fortran, zero FFI, zero system libraries.

Why OxiLLaMa 0.1.1 matters

llama.cpp is remarkable engineering, but its foundations show their age the moment you push past a happy-path desktop build. It is C and C++, which means manual memory management, the ever-present risk of segfaults and undefined behavior, and a build that drags in heavy native dependencies. WebAssembly and embedded support are afterthoughts, and integrating it cleanly into a Rust application means wrapping an FFI boundary you have to babysit.

OxiLLaMa starts from the opposite premise: memory safety by construction, one toolchain, and a binary that drops into a Rust program as an ordinary crate. Version 0.1.1 turns that foundation into measurable performance:

FlashAttention removes the N×N allocation. The new tiled CPU kernel uses BQ=BK=64 blocking with online softmax, so the full attention matrix is never materialized — memory stops scaling quadratically with sequence length.
~12% faster Q4_K_M decode. Fused dequant+GEMM kernels skip the scratch buffer entirely on the AVX2 and NEON Q4_0 / Q4_K paths.
True continuous batching with zero padding waste. Per-request KV slot allocation lets heterogeneous request lengths share the engine without padding to a common length.
MLA cuts KV-cache memory by up to 93%. Multi-head Latent Attention compresses the KV cache via a low-rank projection with decoupled RoPE.
GPU now covers 10 quantization types. Four new GEMV kernels (IQ2_XXS, IQ2_S, IQ3_XXS, IQ3_S) join the existing wgpu shaders.

Technical Deep Dive: the inference pipeline, crate by crate

OxiLLaMa’s 11 crates form a clean pipeline from model file to streamed tokens. Here is where 0.1.1 added muscle.

oxillama-gguf — loading. The GGUF loader gained serious resilience. GgufModel::resume() reads an adjacent .oxiresume sidecar checkpoint, validates the last-valid byte offset, and exposes a ResumeHandle::finish() path so an interrupted HuggingFace pull continues instead of restarting. ShardedGgufModel::load_sharded() auto-discovers every HuggingFace-named sibling shard (<base>-NNNNN-of-MMMMM.gguf) from a single shard path and presents them as one unified logical model. An optional quantize-on-the-fly pass dequantizes and re-quantizes tensors to a target format during load.

oxillama-quant — dequantization. AVX2 kernels for Q4_1 / Q5_0 / Q5_1 / Q8_1 complete full AVX2 coverage of the legacy quant types. On Apple Silicon, NEON now accelerates Q4_1 / Q5_0 / Q5_1 / Q8_1 / Q2_K / Q3_K plus all 11 IQ types. AVX-512 picks up TQ1_0 / TQ2_0 / Q5_0 / Q8_K, extending AVX-512 to 10 types.

oxillama-arch — model graphs. Five new architectures arrived: DBRX (16-expert MoE, top-4), Grok-1 (8-expert MoE, top-2), DeepSeek-V3’s sigmoid-with-bias MoE scoring, Mamba-2 (selective scan with a learned Δ), plus OLMo2, Yi, Granite, MiniCPM, and InternLM3. A new SequenceState trait generalises the KV-cache slot interface to state-space models, and a MlaLayer primitive brings Multi-head Latent Attention with decoupled RoPE. A full DeepSeekV2Model combines MLA attention with DeepSeekMoE sparse routing (N shared experts plus top-K routed experts) and 3-bit/8-bit quantized expert dispatch.

oxillama-runtime — execution and sessions. The runtime added EngineSnapshot (snapshot.rs): InferenceEngine::snapshot() captures the full KV cache and sampler RNG state into a byte blob, and InferenceEngine::resume() validates the model fingerprint and restores it — session persistence across process restarts. The blob is serialized with oxicode, the COOLJAPAN Pure Rust codec, in line with our workspace serialization policy.

oxillama-gpu — acceleration. Beyond the four new GEMV kernels, 0.1.1 adds Q2_K / Q3_K / Q8_K / IQ4_XS GEMV WGSL shaders, a tiled GEMM WGSL shader (TILE_M/N=32, TILE_K=16, shared-memory cooperative load) for prefill, and a fused attention WGSL kernel that does QK + softmax + AV in a single GPU dispatch. An async WebGPU bridge (gpu_bridge.rs) exposes initWebGpuDevice(), webgpuDequantQ4_0Async(), and webgpuGemvAsync() via wasm_bindgen_futures::JsFuture for real GPU dispatch in WebGPU-capable browsers.

oxillama-server — serving. The server exposes an OpenAI-compatible HTTP API (POST /v1/chat/completions, /v1/completions, /v1/embeddings) with SSE streaming and a [DONE] sentinel, plus llama.cpp CLI flag aliases (-n/--n-predict, --temperature, -c/--n-ctx, --seed, --repeat-penalty, --min-p) so existing tooling feels at home.

On x86-64 (8 cores, AVX2), OxiLLaMa targets at least 80% of llama.cpp throughput: LLaMA-3-8B Q4_K_M at ~25 t/s against llama.cpp’s ~30 t/s, Mistral-7B Q4_K_M at >= 27 t/s against ~32 t/s, and OxiBonsai’s Bonsai-8B Q1_0_G128 1-bit quant at >= 22 t/s against ~25 t/s.

Getting Started

Add the library to your project:

cargo add oxillama

The fastest way to try it is the CLI. Run a prompt directly:

oxillama run --model path/to/model.gguf \
  --prompt "Explain quantum computing in simple terms" \
  --max-tokens 256 --temp 0.7

Start an OpenAI-compatible server:

oxillama serve --model path/to/model.gguf --host 0.0.0.0 --port 8080

Or inspect a model file:

oxillama info --model path/to/model.gguf

Prefer to embed it? The examples/load_and_generate.rs flow loads a GGUF, configures the sampler, and streams tokens to stdout — load the model, build an engine, and print each token as it arrives.

What’s New in 0.1.1

FlashAttention tiled CPU kernel — BQ=BK=64 blocking with online softmax and rayon per-head parallelism; no more full N×N attention-matrix allocation.
True continuous batching — per-request KV slot allocation through a BatchedKvView trait, so requests of different lengths batch together with zero padding waste.
Fused dequant+GEMM — Q4_0 and Q4_K AVX2 + NEON paths skip the scratch buffer, for a measured ~12% throughput gain on Q4_K_M decode.
OxiBLAS float GEMM fallback — F16 / BF16 / F32 tensor paths now route through OxiBLAS GEMM instead of naive loops.
Tiled GEMM WGSL shader — TILE_M/N=32, TILE_K=16, shared-memory cooperative load, replacing naive GPU matmul for prefill.
Fused attention WGSL kernel — QK + softmax + AV in a single GPU dispatch, eliminating intermediate buffer round-trips.
4 new GPU GEMV kernels — IQ2_XXS, IQ2_S, IQ3_XXS, IQ3_S; GPU now covers 10 quantization types.
5 new architectures — DBRX, Grok-1, DeepSeek-V3, Mamba-2, plus OLMo2, Yi, Granite, MiniCPM, and InternLM3.
SequenceState trait — an arch-internal SSM abstraction that generalises the KV-cache slot interface to state-space models.
Partial-download resume — GgufModel::resume() survives interrupted HuggingFace pulls via an .oxiresume sidecar.
Snapshot / resume — InferenceEngine::snapshot() and resume() persist KV cache and sampler RNG across restarts, serialized with oxicode.
MLA and DeepSeek-V2 — Multi-head Latent Attention (MlaLayer) and a full DeepSeekV2Model with sparse MoE routing.

Tips

Survive a flaky download. If a HuggingFace pull drops, don’t start over — GgufModel::resume() reads the adjacent .oxiresume checkpoint, validates the last-valid offset, and continues where it left off; call ResumeHandle::finish() when the bytes are complete.
Load big models from one shard path. Point ShardedGgufModel::load_sharded() at any single <base>-NNNNN-of-MMMMM.gguf shard and it discovers the siblings automatically, presenting one logical model.
Persist a chat session across restarts. Capture state with InferenceEngine::snapshot() and restore later with InferenceEngine::resume() — the byte blob is oxicode-serialized and fingerprint-checked against the model, so a process restart doesn’t lose the conversation.
Turn on the GPU backend. Enable the GPU feature to pick up the new fused-attention WGSL path and tiled GEMM prefill shader; in the browser, the async WebGPU bridge dispatches real GPU work.
Hot-swap LoRA adapters at runtime. See examples/lora_apply.rs for swapping two LoRA adapters on a running engine with no model reload.
Install the CLI standalone. Grab the binary directly with cargo install oxillama-cli, and read RECIPES.md for the 8-recipe cookbook covering generation, serving, LoRA, speculative decoding, snapshot/resume, WASM browser chat, partial-download resume, and sharded loading.

This is the foundation

OxiLLaMa fits squarely into the COOLJAPAN Pure Rust stack as it stands today. Tensor primitives and neural ops come from SciRS2 (scirs2-core / linalg / neural 0.4.2); matrix math runs on OxiBLAS 0.2.1; RoPE acceleration uses OxiFFT 0.2.0; Japanese text is tokenized with MeCrab. The Bonsai-8B Q1_0_G128 1-bit quant path comes from OxiBonsai, and session snapshots are serialized with oxicode 0.2.1. Every layer is Pure Rust, every dependency compiles without a C toolchain, and the whole thing collapses to a single binary or a WASM module.

Repository: https://github.com/cool-japan/oxillama

Star the repo if you want fast, memory-safe LLM inference without the C++ build chain. Pure Rust LLM inference is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ April 24, 2026