Pure Rust LLM inference just grew up: 27 architectures, a real sampler stack, and a drop-in /v1/responses API — with no C, no FFI, and no compromises.
Today we released OxiLLaMa 0.1.3 — a feature-rich step forward for the Pure Rust LLM inference engine, adding two new model architectures (BLOOM and Phi-3.5-MoE), a five-stage advanced sampler suite, embedding pooling modes, a drop-in /v1/responses server API with per-API-key rate limiting, AVX-512 IQ kernels, GPU-resident sampling kernels, and zero-copy DLPack interop with PyTorch.
No C. No C++. No Fortran. No FFI. No system libraries. Where llama.cpp drags along a C++ toolchain, fragile build dependencies, and the segfault class of bugs that comes with manual memory management, OxiLLaMa compiles to a single static native binary — or to WebAssembly, or to embedded targets — from one codebase. The tagline still holds: Pure Rust LLM Inference Engine — The Sovereign Alternative to llama.cpp. It is built on the COOLJAPAN stack: SciRS2 for tensor primitives and neural ops, OxiBLAS for Pure Rust BLAS, OxiFFT for the FFT behind RoPE, and MeCrab for Japanese tokenization.
Why OxiLLaMa 0.1.3 is a game changer
llama.cpp is a remarkable project, but it carries the costs of its C/C++ heritage: memory-unsafety and the segfaults that follow, a heavy build that wants a full native toolchain and system libraries, awkward WebAssembly and embedded stories, and tooling that accretes ad hoc over time. OxiLLaMa takes the opposite path — memory safety by construction, one cargo build, the same binary everywhere — and 0.1.3 turns that foundation into something you can ship serious workloads on.
Concretely, this release brings:
- 27 architectures, now including BLOOM (with an ALiBi slope-per-head bias) and Phi-3.5-MoE (16-expert, top-2 sparse MoE). The count moved 25 → 27 this cycle.
- A 5-stage advanced sampler suite — DRY, XTC, TypicalP, TopA, Eta — wired into the sampler chain and byte-identical when every new field is left at its default. You opt in; you never get surprised.
- A drop-in
/v1/responsesAPI with SSE streaming and conversation chaining, plus per-API-key rate limiting backed by lazy token buckets. - AVX-512 IQ kernels delivering roughly 2x per-iteration throughput over the AVX2 path, gated at runtime so they light up only where the CPU supports them.
- GPU-resident sampling kernels — softmax, top-k, and categorical sampling that run on the device, with a clean CPU fallback when no GPU is present.
- Zero-copy DLPack interop with PyTorch — pull logits or embeddings straight into a
torch.Tensorwith no Rust-level torch dependency.
As of 0.1.3 that is roughly 165,000 lines of Pure Rust across 11 crates, with 2,461 tests passing.
Technical Deep Dive: how the new pieces fit the crate layout
OxiLLaMa is organized as 11 focused crates — oxillama (meta), oxillama-gguf, oxillama-quant, oxillama-arch, oxillama-runtime, oxillama-server, oxillama-bench, oxillama-gpu, oxillama-py, oxillama-wasm, and oxillama-cli. Here is where 0.1.3 landed.
oxillama-arch — two new architectures (25 → 27). BLOOM arrives as a full decoder stack: an AlibiBias that computes a slope per head as m_h = 2^(-8*(h+1)/num_heads), ALiBi positional bias instead of RoPE, pre-LayerNorm, a GELU FFN, and bias terms on all projections. Phi-3.5-MoE reuses the Phi-3 merged-QKV and partial-RoPE path, swapping in a sparse MoE FFN that routes through 16 experts and activates the top 2. This window also brought Mixtral (sparse top-2-of-8 MoE), StableLM (parallel attention + FFN, partial RoPE), and GPT-NeoX (parallel residual, learned-bias LayerNorm).
oxillama-runtime — the sampler chain and embedding pooling. Five new stages register into the chain in a deliberate order — DRY → XTC → TypicalP → TopA → Eta. DRY is a “Don’t Repeat Yourself” n-gram penalty with configurable sequence_breakers; XTC excludes top choices; TypicalP does locally-typical sampling via Shannon entropy; TopA uses an adaptive threshold of a * max_prob^2; Eta applies an entropy-scaled cutoff. They surface as 9 new #[serde(default)] fields on SamplerConfig, so existing configs produce byte-identical output. On the embedding side, a PoolingMode { Last, Mean, Max, Cls } enum plus embed_with(text, mode) and embed_batch_with(texts, mode) join InferenceEngine; the existing embed() / embed_batch() delegate to PoolingMode::Last for zero behaviour change. The runtime also gained logit-bias and banned-tokens fields (applied first in the chain), a JsonSchemaCompiler that turns a JSON-Schema subset into a GBNF grammar for constrained decoding, and a numerically stable beam_generate().
oxillama-server — /v1/responses and per-key limits. The new /v1/responses endpoint supports POST, GET, and GET-by-ID, with SSE streaming that emits response.created, response.output_text.delta, response.completed, and [DONE] events, plus previous_response_id chaining for multi-turn conversations. Rate limiting is handled by a PerKeyRateLimiter holding a lazy per-key TokenBucket map: it takes a read lock on repeat hits and a write lock only when it first sees a key, reads either Authorization: Bearer or X-Api-Key, and lets anonymous requests fall through to the global limiter. The server also picked up an OpenAI Assistants API subset (thread persistence, a run worker, run-step events, SSE run streaming) and a Files store, a multi-LoRA registry with per-request adapter selection, and server-side prefix-KV cache wiring to skip re-prefilling shared system prompts.
oxillama-quant + oxillama-gpu — SIMD and on-device sampling. New Iq2XxsAvx512, Iq2XsAvx512, Iq3SAvx512, and Iq4XsAvx512 kernels mirror the AVX2 templates with __m512i, using _mm512_permutexvar_epi8 for grid lookup via AVX-512BW for roughly 2x per-iteration throughput, all runtime-guarded by is_x86_feature_detected!("avx512bw"). A fused matvec_q8 does single-pass dequant-plus-dot in registers (no scratch f32) for Q5_0 / Q5_1 / Q8_1. AVX-512 K-quant kernels (Q2_K, Q3_K) and legacy completeness (Q4_1/Q5_1/Q8_1) mean all 11 legacy quant types now have a full four-tier SIMD ladder: AVX-512 → AVX2 → NEON → scalar. On the GPU side, a sampling.wgsl shader implements softmax_logits (a 256-thread shared-memory two-pass reduction with temperature scaling and a temp=0 argmax path), topk_partition (workgroup cooperative selection for k ≤ 256), and sample_categorical (an LCG RNG plus CDF walk), exposed through a SamplingKernel API (softmax, top_k, sample) with a CPU-reference fallback and a graceful GpuError::NoAdapter when no device is found.
oxillama-py — DLPack and Torch. A full DLPack v0.8 producer and consumer bridges Vec<f32> ↔ PyCapsule (vec_to_dlpack() / dlpack_to_vec()), with PyEngine::logits_dlpack() and embeddings_dlpack(). A bundled torch_helper.py adds Engine.logits_torch(text) -> torch.Tensor and Engine.embeddings_torch(text) via a lazy import torch and torch.from_dlpack(...) — with no Rust-level torch dependency, so you get a clean ImportError if torch is absent rather than a build-time coupling.
Getting Started
Add the library to your project:
cargo add oxillama
Or grab the CLI:
cargo install oxillama-cli
OxiLLaMa targets Rust 1.86+. A minimal run that exercises a couple of the new sampler fields and an embedding pooling mode:
use oxillama::{InferenceEngine, SamplerConfig, PoolingMode};
fn main() -> anyhow::Result<()> {
let mut engine = InferenceEngine::from_gguf("models/llama-3-8b-q4_k_m.gguf")?;
// Opt into the new samplers. Leaving these at default keeps output
// byte-identical to 0.1.2; setting them turns the new stages on.
let sampler = SamplerConfig {
dry_multiplier: 0.8,
dry_base: 1.75,
dry_allowed_length: 2,
typical_p: 0.95,
..Default::default()
};
let text = engine.generate("Explain Pure Rust inference in one sentence.", &sampler)?;
println!("{text}");
// Mean-pool an embedding instead of last-token pooling.
let vec = engine.embed_with("sovereign inference", PoolingMode::Mean)?;
println!("dim = {}", vec.len());
Ok(())
}
Prefer the server? Start it and hit the new endpoint with curl:
oxillama serve --model models/llama-3-8b-q4_k_m.gguf --port 8080
curl http://localhost:8080/v1/responses \
-H "Authorization: Bearer sk-local" \
-H "Content-Type: application/json" \
-d '{"model":"llama-3-8b","input":"Hello from Pure Rust"}'
What’s New in 0.1.3
- BLOOM + Phi-3.5-MoE architectures — a BLOOM decoder stack (ALiBi slope-per-head bias, pre-LayerNorm, GELU FFN, bias on all projections) and Phi-3.5-MoE (16 experts, top-2) reusing the Phi-3 merged-QKV + partial-RoPE path. Architecture count: 25 → 27.
- Advanced sampler suite — 5 new stages (DRY → XTC → TypicalP → TopA → Eta) and 9 new
#[serde(default)]SamplerConfigfields; byte-identical output at defaults. - Embedding pooling —
PoolingMode { Last, Mean, Max, Cls }withembed_with/embed_batch_with; the existingembed/embed_batchdelegate toLastfor zero behaviour change. - Responses API + per-API-key rate limiting —
/v1/responses(POST + GET + GET-by-ID) with SSE streaming andprevious_response_idchaining, plus aPerKeyRateLimiterwith lazy per-key token buckets. - AVX-512 IQ kernels —
Iq2XxsAvx512,Iq2XsAvx512,Iq3SAvx512,Iq4XsAvx512mirroring the AVX2 templates (~2x per-iteration throughput), runtime-guarded; plus fusedmatvec_q8for Q5_0 / Q5_1 / Q8_1. - GPU sampling kernels — a
sampling.wgslshader (softmax_logits,topk_partition,sample_categorical) and aSamplingKernelAPI with CPU fallback. - Speculative decoding bench —
SpeculativeBenchTable+run_acceptance_sweep()producing a Markdown 2-D speedup grid over draft sizes[1,2,4,8]× accept thresholds[0.5,0.7,0.85,0.95]. - Python Torch interop — DLPack v0.8 producer/consumer,
logits_dlpack()/embeddings_dlpack(), and atorch_helper.pywithlogits_torch/embeddings_torchand no Rust torch dependency. - More architectures this window — Mixtral (sparse top-2-of-8 MoE), StableLM (parallel attention + FFN, partial RoPE), GPT-NeoX (parallel residual, learned-bias LayerNorm).
- New CLI subcommands —
oxillama quantize,oxillama convert,oxillama verify, andoxillama tokenize/oxillama detokenize. - AVX-512 K-quant kernels (Q2_K, Q3_K) complete the four-tier SIMD ladder (AVX-512 → AVX2 → NEON → scalar) for all 11 legacy quant types.
- Test suite: 2,020 → 2,461 tests.
Per-crate test breakdown for 0.1.3: gguf=278, quant=401, arch=451, runtime=420, server=195, cli=42, bench=129, gpu=201, wasm=213, py=131.
On x86-64 with 8 cores and AVX2, OxiLLaMa continues to target at least 80% of llama.cpp’s throughput: LLaMA-3-8B Q4_K_M lands around 30 t/s (target ≥ 25 t/s), Mistral-7B Q4_K_M around 32 t/s (target ≥ 27 t/s), and Bonsai-8B Q1_0_G128 around 25 t/s (target ≥ 22 t/s).
Tips
-
Curb repetition with DRY without changing anything else. Set
dry_multiplier,dry_base, anddry_allowed_lengthonSamplerConfig; the defaults leave output byte-identical, so you only pay for the n-gram penalty when you ask for it.let sampler = SamplerConfig { dry_multiplier: 0.8, dry_base: 1.75, dry_allowed_length: 2, ..Default::default() }; -
Pick the right embedding pooling for the job. Use
embed_with(text, PoolingMode::Mean)for retrieval-style embeddings, orPoolingMode::Cls/PoolingMode::Maxwhere the model expects it.embed()still defaults toLast. -
Chain conversation turns server-side. Pass
previous_response_idon/v1/responsesso the server tracks history for you instead of re-sending the whole transcript each turn. -
Throttle per client, not just globally. Configure
ServerConfig.per_key_rate_limitsand thePerKeyRateLimiterwill mint a token bucket per API key on first use; anonymous callers fall back to the global limiter automatically. -
Convert HuggingFace weights without leaving the CLI. Run
oxillama convert in.safetensors out.ggufto synthesise a GGUF v3 from a.safetensorscheckpoint, thenoxillama verify out.gguf --sha256 <hex>to confirm integrity. -
Pull logits straight into PyTorch, zero-copy. With torch installed,
Engine.logits_torch(text)hands you atorch.Tensorover the same memory via DLPack — no Rust torch dependency, no serialization round-trip. -
Let AVX-512 do the heavy lifting. The new IQ kernels activate automatically when
avx512bwis detected, so on a capable CPU you simply get the ~2x per-iteration speedup with no flags to flip.
This is the foundation
OxiLLaMa fits the COOLJAPAN Pure Rust stack as it stands today. Tensor primitives and neural ops come from SciRS2 (scirs2-core / scirs2-linalg / scirs2-neural 0.4.3); the dense math leans on OxiBLAS (0.2.1); RoPE rides on OxiFFT (0.3); Japanese tokenization is handled by MeCrab; serialization uses oxicode (0.2.2); and the aggressive 1-bit Q1_0_G128 quantization that powers Bonsai-8B comes from OxiBonsai. Remote GGUF loading and the GPU path round it out with ureq 3.3 and wgpu 29.0.3. Every layer is Pure Rust, all the way down.
Repository: https://github.com/cool-japan/oxillama
Star the repo if you believe LLM inference should be memory-safe, dependency-light, and yours to deploy anywhere. Pure Rust LLM inference is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ May 5, 2026