COOLJAPAN
← All posts

OxiLLaMa 0.1.3 Released — BLOOM + Phi-3.5-MoE, a 5-Stage Advanced Sampler Suite, and /v1/responses with Zero-Copy Torch Interop

OxiLLaMa 0.1.3 is the Pure Rust LLM inference engine and sovereign alternative to llama.cpp. This release adds BLOOM + Phi-3.5-MoE architectures (now 27 total), a 5-stage advanced sampler suite (DRY/XTC/TypicalP/TopA/Eta) that is byte-identical at defaults, embedding pooling modes, a drop-in /v1/responses API with per-API-key rate limiting, AVX-512 IQ kernels at ~2x per-iteration throughput, GPU-resident sampling kernels, and zero-copy DLPack PyTorch interop — 2,461 tests passing.

release oxillama llm-inference gguf llama.cpp pure-rust sampling avx-512 scirs2

Pure Rust LLM inference just grew up: 27 architectures, a real sampler stack, and a drop-in /v1/responses API — with no C, no FFI, and no compromises.

Today we released OxiLLaMa 0.1.3 — a feature-rich step forward for the Pure Rust LLM inference engine, adding two new model architectures (BLOOM and Phi-3.5-MoE), a five-stage advanced sampler suite, embedding pooling modes, a drop-in /v1/responses server API with per-API-key rate limiting, AVX-512 IQ kernels, GPU-resident sampling kernels, and zero-copy DLPack interop with PyTorch.

No C. No C++. No Fortran. No FFI. No system libraries. Where llama.cpp drags along a C++ toolchain, fragile build dependencies, and the segfault class of bugs that comes with manual memory management, OxiLLaMa compiles to a single static native binary — or to WebAssembly, or to embedded targets — from one codebase. The tagline still holds: Pure Rust LLM Inference Engine — The Sovereign Alternative to llama.cpp. It is built on the COOLJAPAN stack: SciRS2 for tensor primitives and neural ops, OxiBLAS for Pure Rust BLAS, OxiFFT for the FFT behind RoPE, and MeCrab for Japanese tokenization.

Why OxiLLaMa 0.1.3 is a game changer

llama.cpp is a remarkable project, but it carries the costs of its C/C++ heritage: memory-unsafety and the segfaults that follow, a heavy build that wants a full native toolchain and system libraries, awkward WebAssembly and embedded stories, and tooling that accretes ad hoc over time. OxiLLaMa takes the opposite path — memory safety by construction, one cargo build, the same binary everywhere — and 0.1.3 turns that foundation into something you can ship serious workloads on.

Concretely, this release brings:

As of 0.1.3 that is roughly 165,000 lines of Pure Rust across 11 crates, with 2,461 tests passing.

Technical Deep Dive: how the new pieces fit the crate layout

OxiLLaMa is organized as 11 focused crates — oxillama (meta), oxillama-gguf, oxillama-quant, oxillama-arch, oxillama-runtime, oxillama-server, oxillama-bench, oxillama-gpu, oxillama-py, oxillama-wasm, and oxillama-cli. Here is where 0.1.3 landed.

oxillama-arch — two new architectures (25 → 27). BLOOM arrives as a full decoder stack: an AlibiBias that computes a slope per head as m_h = 2^(-8*(h+1)/num_heads), ALiBi positional bias instead of RoPE, pre-LayerNorm, a GELU FFN, and bias terms on all projections. Phi-3.5-MoE reuses the Phi-3 merged-QKV and partial-RoPE path, swapping in a sparse MoE FFN that routes through 16 experts and activates the top 2. This window also brought Mixtral (sparse top-2-of-8 MoE), StableLM (parallel attention + FFN, partial RoPE), and GPT-NeoX (parallel residual, learned-bias LayerNorm).

oxillama-runtime — the sampler chain and embedding pooling. Five new stages register into the chain in a deliberate order — DRY → XTC → TypicalP → TopA → Eta. DRY is a “Don’t Repeat Yourself” n-gram penalty with configurable sequence_breakers; XTC excludes top choices; TypicalP does locally-typical sampling via Shannon entropy; TopA uses an adaptive threshold of a * max_prob^2; Eta applies an entropy-scaled cutoff. They surface as 9 new #[serde(default)] fields on SamplerConfig, so existing configs produce byte-identical output. On the embedding side, a PoolingMode { Last, Mean, Max, Cls } enum plus embed_with(text, mode) and embed_batch_with(texts, mode) join InferenceEngine; the existing embed() / embed_batch() delegate to PoolingMode::Last for zero behaviour change. The runtime also gained logit-bias and banned-tokens fields (applied first in the chain), a JsonSchemaCompiler that turns a JSON-Schema subset into a GBNF grammar for constrained decoding, and a numerically stable beam_generate().

oxillama-server/v1/responses and per-key limits. The new /v1/responses endpoint supports POST, GET, and GET-by-ID, with SSE streaming that emits response.created, response.output_text.delta, response.completed, and [DONE] events, plus previous_response_id chaining for multi-turn conversations. Rate limiting is handled by a PerKeyRateLimiter holding a lazy per-key TokenBucket map: it takes a read lock on repeat hits and a write lock only when it first sees a key, reads either Authorization: Bearer or X-Api-Key, and lets anonymous requests fall through to the global limiter. The server also picked up an OpenAI Assistants API subset (thread persistence, a run worker, run-step events, SSE run streaming) and a Files store, a multi-LoRA registry with per-request adapter selection, and server-side prefix-KV cache wiring to skip re-prefilling shared system prompts.

oxillama-quant + oxillama-gpu — SIMD and on-device sampling. New Iq2XxsAvx512, Iq2XsAvx512, Iq3SAvx512, and Iq4XsAvx512 kernels mirror the AVX2 templates with __m512i, using _mm512_permutexvar_epi8 for grid lookup via AVX-512BW for roughly 2x per-iteration throughput, all runtime-guarded by is_x86_feature_detected!("avx512bw"). A fused matvec_q8 does single-pass dequant-plus-dot in registers (no scratch f32) for Q5_0 / Q5_1 / Q8_1. AVX-512 K-quant kernels (Q2_K, Q3_K) and legacy completeness (Q4_1/Q5_1/Q8_1) mean all 11 legacy quant types now have a full four-tier SIMD ladder: AVX-512 → AVX2 → NEON → scalar. On the GPU side, a sampling.wgsl shader implements softmax_logits (a 256-thread shared-memory two-pass reduction with temperature scaling and a temp=0 argmax path), topk_partition (workgroup cooperative selection for k ≤ 256), and sample_categorical (an LCG RNG plus CDF walk), exposed through a SamplingKernel API (softmax, top_k, sample) with a CPU-reference fallback and a graceful GpuError::NoAdapter when no device is found.

oxillama-py — DLPack and Torch. A full DLPack v0.8 producer and consumer bridges Vec<f32> ↔ PyCapsule (vec_to_dlpack() / dlpack_to_vec()), with PyEngine::logits_dlpack() and embeddings_dlpack(). A bundled torch_helper.py adds Engine.logits_torch(text) -> torch.Tensor and Engine.embeddings_torch(text) via a lazy import torch and torch.from_dlpack(...) — with no Rust-level torch dependency, so you get a clean ImportError if torch is absent rather than a build-time coupling.

Getting Started

Add the library to your project:

cargo add oxillama

Or grab the CLI:

cargo install oxillama-cli

OxiLLaMa targets Rust 1.86+. A minimal run that exercises a couple of the new sampler fields and an embedding pooling mode:

use oxillama::{InferenceEngine, SamplerConfig, PoolingMode};

fn main() -> anyhow::Result<()> {
    let mut engine = InferenceEngine::from_gguf("models/llama-3-8b-q4_k_m.gguf")?;

    // Opt into the new samplers. Leaving these at default keeps output
    // byte-identical to 0.1.2; setting them turns the new stages on.
    let sampler = SamplerConfig {
        dry_multiplier: 0.8,
        dry_base: 1.75,
        dry_allowed_length: 2,
        typical_p: 0.95,
        ..Default::default()
    };

    let text = engine.generate("Explain Pure Rust inference in one sentence.", &sampler)?;
    println!("{text}");

    // Mean-pool an embedding instead of last-token pooling.
    let vec = engine.embed_with("sovereign inference", PoolingMode::Mean)?;
    println!("dim = {}", vec.len());
    Ok(())
}

Prefer the server? Start it and hit the new endpoint with curl:

oxillama serve --model models/llama-3-8b-q4_k_m.gguf --port 8080

curl http://localhost:8080/v1/responses \
  -H "Authorization: Bearer sk-local" \
  -H "Content-Type: application/json" \
  -d '{"model":"llama-3-8b","input":"Hello from Pure Rust"}'

What’s New in 0.1.3

Per-crate test breakdown for 0.1.3: gguf=278, quant=401, arch=451, runtime=420, server=195, cli=42, bench=129, gpu=201, wasm=213, py=131.

On x86-64 with 8 cores and AVX2, OxiLLaMa continues to target at least 80% of llama.cpp’s throughput: LLaMA-3-8B Q4_K_M lands around 30 t/s (target ≥ 25 t/s), Mistral-7B Q4_K_M around 32 t/s (target ≥ 27 t/s), and Bonsai-8B Q1_0_G128 around 25 t/s (target ≥ 22 t/s).

Tips

This is the foundation

OxiLLaMa fits the COOLJAPAN Pure Rust stack as it stands today. Tensor primitives and neural ops come from SciRS2 (scirs2-core / scirs2-linalg / scirs2-neural 0.4.3); the dense math leans on OxiBLAS (0.2.1); RoPE rides on OxiFFT (0.3); Japanese tokenization is handled by MeCrab; serialization uses oxicode (0.2.2); and the aggressive 1-bit Q1_0_G128 quantization that powers Bonsai-8B comes from OxiBonsai. Remote GGUF loading and the GPU path round it out with ureq 3.3 and wgpu 29.0.3. Every layer is Pure Rust, all the way down.

Repository: https://github.com/cool-japan/oxillama

Star the repo if you believe LLM inference should be memory-safe, dependency-light, and yours to deploy anywhere. Pure Rust LLM inference is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ May 5, 2026

↑ Back to all posts