COOLJAPAN

Posts tagged #llama.cpp

4 posts

May 5, 2026 · 11 min

OxiLLaMa 0.1.3 Released — BLOOM + Phi-3.5-MoE, a 5-Stage Advanced Sampler Suite, and /v1/responses with Zero-Copy Torch Interop

OxiLLaMa 0.1.3 is the Pure Rust LLM inference engine and sovereign alternative to llama.cpp. This release adds BLOOM + Phi-3.5-MoE architectures (now 27 total), a 5-stage advanced sampler suite (DRY/XTC/TypicalP/TopA/Eta) that is byte-identical at defaults, embedding pooling modes, a drop-in /v1/responses API with per-API-key rate limiting, AVX-512 IQ kernels at ~2x per-iteration throughput, GPU-resident sampling kernels, and zero-copy DLPack PyTorch interop — 2,461 tests passing.

releaseoxillamallm-inference
Apr 25, 2026 · 7 min

OxiLLaMa 0.1.2 Released — HuggingFace Hub Pulls, Full-Screen TUI Chat, and Conversation Save/Resume in Pure Rust

OxiLLaMa 0.1.2 is the Pure Rust LLM inference engine and sovereign alternative to llama.cpp. This release adds `oxillama hub pull/list/rm` (hf-hub, no Python), a full-screen TUI chat with live streaming (ratatui), conversation save/resume serialized via oxicode with SHA-256 integrity, and real weight loading for DBRX, Grok-1, and Mamba-2.

releaseoxillamallm-inference
Apr 24, 2026 · 8 min

OxiLLaMa 0.1.1 Released — FlashAttention, True Continuous Batching, and 5 New Architectures in Pure Rust

OxiLLaMa is a Pure Rust LLM inference engine — the sovereign alternative to llama.cpp. Version 0.1.1 ships a tiled FlashAttention CPU kernel, true continuous batching with zero padding waste, fused dequant+GEMM (~12% Q4_K_M decode gain), 5 new architectures (DBRX, Grok-1, Mamba-2, DeepSeek-V3, and more), and GPU coverage extended to 10 quantization types.

releaseoxillamallm-inference
Apr 15, 2026 · 3 min

OxiLLaMa 0.1.0 Released — Pure Rust LLM Inference Engine, Sovereign Alternative to llama.cpp

Complete GGUF loading + 25 quantized formats + OpenAI-compatible API server — all in pure Rust. 56.2k SLoC, 11 crates, no C/C++/Fortran, built on SciRS2/OxiBLAS/OxiFFT. ≥80% of llama.cpp throughput, WASM/GPU/Python bindings, LLaMA/Mistral/Gemma/Phi/LLaVA support. The sovereign LLM inference layer for SciRS2 and the entire COOLJAPAN ecosystem (now 21M+ SLoC total).

releaseoxillamallm-inference