Posts tagged #sampling

1 posts

May 5, 2026 · 11 min

OxiLLaMa 0.1.3 Released — BLOOM + Phi-3.5-MoE, a 5-Stage Advanced Sampler Suite, and /v1/responses with Zero-Copy Torch Interop

OxiLLaMa 0.1.3 is the Pure Rust LLM inference engine and sovereign alternative to llama.cpp. This release adds BLOOM + Phi-3.5-MoE architectures (now 27 total), a 5-stage advanced sampler suite (DRY/XTC/TypicalP/TopA/Eta) that is byte-identical at defaults, embedding pooling modes, a drop-in /v1/responses API with per-API-key rate limiting, AVX-512 IQ kernels at ~2x per-iteration throughput, GPU-resident sampling kernels, and zero-copy DLPack PyTorch interop — 2,461 tests passing.

releaseoxillamallm-inference