COOLJAPAN

Posts tagged #continuous-batching

1 posts

Apr 24, 2026 · 8 min

OxiLLaMa 0.1.1 Released — FlashAttention, True Continuous Batching, and 5 New Architectures in Pure Rust

OxiLLaMa is a Pure Rust LLM inference engine — the sovereign alternative to llama.cpp. Version 0.1.1 ships a tiled FlashAttention CPU kernel, true continuous batching with zero padding waste, fused dequant+GEMM (~12% Q4_K_M decode gain), 5 new architectures (DBRX, Grok-1, Mamba-2, DeepSeek-V3, and more), and GPU coverage extended to 10 quantization types.

releaseoxillamallm-inference