COOLJAPAN
← All posts

OxiLLaMa 0.1.2 Released — HuggingFace Hub Pulls, Full-Screen TUI Chat, and Conversation Save/Resume in Pure Rust

OxiLLaMa 0.1.2 is the Pure Rust LLM inference engine and sovereign alternative to llama.cpp. This release adds `oxillama hub pull/list/rm` (hf-hub, no Python), a full-screen TUI chat with live streaming (ratatui), conversation save/resume serialized via oxicode with SHA-256 integrity, and real weight loading for DBRX, Grok-1, and Mamba-2.

release oxillama llm-inference gguf llama.cpp pure-rust huggingface tui scirs2

Manage models, chat, and resume conversations — all from one Pure Rust binary, with no Python and no C in sight.

Today we released OxiLLaMa 0.1.2 — a focused developer-experience and model-coverage point release that adds direct HuggingFace Hub model management, a full-screen streaming TUI chat, durable conversation save/resume, and real weight loading for three more architectures.

No C. No C++. No Fortran. No FFI. No system libraries. Where llama.cpp drags a C++ build chain and platform-specific toolchains behind it, OxiLLaMa compiles to a single static binary — or to WebAssembly — from one codebase. The whole engine is ~107,000 lines of Pure Rust across 11 crates, with 2,020 tests passing.

Why OxiLLaMa 0.1.2 matters

llama.cpp is impressive engineering, but it is C and C++ all the way down. That means manual memory management (and the segfaults that come with it), a heavyweight build with platform-specific dependencies, awkward WebAssembly and embedded stories, and — for anything beyond raw inference — a detour through Python tooling just to fetch and manage models.

OxiLLaMa 0.1.2 closes several of those gaps at once:

All of it is memory-safe Rust, and all of it ships in the same binary you already build.

Technical Deep Dive: Hub, TUI, sessions, and architectures

oxillama-cli — model management and the TUI. The new hub subcommand exposes hub pull, hub list, and hub rm, built on hf-hub 0.5. Downloads go over ureq with rustls for Pure Rust TLS, and the directories crate resolves platform-appropriate cache paths so models land in the right place on Linux, macOS, and Windows alike. The TUI chat mode (chat --tui) is built on ratatui 0.30 and crossterm 0.29: it renders a scrollable chat history and an input line, and drives live streaming through spawn_blocking plus an mpsc channel so token delivery never blocks the event loop. Six unit tests cover layout, input handling, and message rendering.

oxillama-runtime — session persistence. Conversation save/resume lives in session.rs. Session::save() and Session::load() serialize the full conversation history via oxicode, the COOLJAPAN Pure Rust codec. A SHA-256 KV sidecar validates integrity on load, and a schema-version guard rejects incompatible session files outright rather than misinterpreting them. The result is conversations that survive process restarts without silent corruption.

oxillama-arch — KV cache and architectures. The KvCacheAccess trait gains kv_dim(), for_each_key(), and for_each_value(), and PagedKvCache fully implements multi-page support for all three. BatchedKvView and KvSlot were promoted to oxillama-arch/traits.rs for cross-crate reuse, and ForwardPass::forward_batched now has a default trait implementation alongside a concrete optimized LLaMA specialisation. On top of that, the previously stubbed DBRX (mixture-of-experts), Grok-1, and Mamba-2 loaders now perform full weight loading from GGUF — Mamba-2 even gets an embed() override for its token-embedding logic — and all three pass end-to-end forward-pass tests.

oxillama-quant — quantization coverage. Complete IQ3_S and IQ3_XXS codebook tables were added, so both formats now dequantize correctly across all kernel paths.

Getting Started

Add the library to your project:

cargo add oxillama

Then grab a model and start chatting — no Python required:

# Pull a GGUF model directly from HuggingFace Hub
oxillama hub pull some-org/some-model-gguf

# Launch the full-screen TUI chat against the downloaded weights
oxillama chat --tui --model some-model.Q4_K_M.gguf

Prefer a one-shot generation or an HTTP endpoint? Both work from the same binary:

# Single prompt
oxillama run --model some-model.Q4_K_M.gguf --prompt "Explain RoPE in one paragraph." --max-tokens 256 --temp 0.7

# OpenAI-compatible server
oxillama serve --model some-model.Q4_K_M.gguf --host 0.0.0.0 --port 8080

The server speaks the OpenAI-compatible HTTP API — POST /v1/chat/completions, /v1/completions, and /v1/embeddings — with SSE streaming terminated by [DONE].

On x86-64 with 8 cores and AVX2, OxiLLaMa targets at least 80% of llama.cpp throughput: LLaMA-3-8B Q4_K_M lands around 30 t/s (target >= 25 t/s), Mistral-7B Q4_K_M around 32 t/s (target >= 27 t/s), and OxiBonsai’s Bonsai-8B Q1_0_G128 1-bit quant around 25 t/s (target >= 22 t/s).

What’s New in 0.1.2

Tips

This is the foundation

OxiLLaMa 0.1.2 fits cleanly into the COOLJAPAN Pure Rust stack as it stands today. Tensor primitives and neural ops come from SciRS2 (scirs2-core/linalg/neural 0.4.2), dense linear algebra from OxiBLAS 0.2.1, and RoPE’s FFT from OxiFFT 0.2.0 — with MeCrab handling Japanese tokenization. Session persistence is serialized through oxicode 0.2.1, model downloads ride hf-hub 0.5 over ureq/rustls for fully Pure Rust networking, and OxiBonsai’s Q1_0_G128 1-bit quantization keeps even 8B-class models lean. Every layer is C/C++/Fortran-free, so the whole thing compiles to native binaries, WebAssembly, and embedded targets from a single codebase.

Repository: https://github.com/cool-japan/oxillama

Star the repo if you want LLM inference you can read, audit, and ship anywhere — without a C++ toolchain or a Python detour.

Pure Rust LLM inference is here — fast, safe, and sovereign.

KitaSan at COOLJAPAN OÜ April 25, 2026

↑ Back to all posts