OxiLLaMa 0.1.2 Released — HuggingFace Hub Pulls, Full-Screen TUI Chat, and Conversation Save/Resume in Pure Rust

Manage models, chat, and resume conversations — all from one Pure Rust binary, with no Python and no C in sight.

Today we released OxiLLaMa 0.1.2 — a focused developer-experience and model-coverage point release that adds direct HuggingFace Hub model management, a full-screen streaming TUI chat, durable conversation save/resume, and real weight loading for three more architectures.

No C. No C++. No Fortran. No FFI. No system libraries. Where llama.cpp drags a C++ build chain and platform-specific toolchains behind it, OxiLLaMa compiles to a single static binary — or to WebAssembly — from one codebase. The whole engine is ~107,000 lines of Pure Rust across 11 crates, with 2,020 tests passing.

Why OxiLLaMa 0.1.2 matters

llama.cpp is impressive engineering, but it is C and C++ all the way down. That means manual memory management (and the segfaults that come with it), a heavyweight build with platform-specific dependencies, awkward WebAssembly and embedded stories, and — for anything beyond raw inference — a detour through Python tooling just to fetch and manage models.

OxiLLaMa 0.1.2 closes several of those gaps at once:

Download models straight from HuggingFace Hub — oxillama hub pull <org/repo> fetches GGUF weights with zero Python in the loop, using hf-hub over Pure Rust TLS.
Chat in a full-screen terminal UI — oxillama chat --tui gives you a scrollable history and an input line with live token streaming that never stutters.
Save and resume conversations across restarts — the chat REPL’s /save and /load persist full history, with a SHA-256 sidecar that verifies state on load.
Three more architectures load real weights — Databricks DBRX, xAI Grok-1, and the Mamba-2 state-space model are no longer stubs; they load and run end to end.

All of it is memory-safe Rust, and all of it ships in the same binary you already build.

Technical Deep Dive: Hub, TUI, sessions, and architectures

oxillama-cli — model management and the TUI. The new hub subcommand exposes hub pull, hub list, and hub rm, built on hf-hub 0.5. Downloads go over ureq with rustls for Pure Rust TLS, and the directories crate resolves platform-appropriate cache paths so models land in the right place on Linux, macOS, and Windows alike. The TUI chat mode (chat --tui) is built on ratatui 0.30 and crossterm 0.29: it renders a scrollable chat history and an input line, and drives live streaming through spawn_blocking plus an mpsc channel so token delivery never blocks the event loop. Six unit tests cover layout, input handling, and message rendering.

oxillama-runtime — session persistence. Conversation save/resume lives in session.rs. Session::save() and Session::load() serialize the full conversation history via oxicode, the COOLJAPAN Pure Rust codec. A SHA-256 KV sidecar validates integrity on load, and a schema-version guard rejects incompatible session files outright rather than misinterpreting them. The result is conversations that survive process restarts without silent corruption.

oxillama-arch — KV cache and architectures. The KvCacheAccess trait gains kv_dim(), for_each_key(), and for_each_value(), and PagedKvCache fully implements multi-page support for all three. BatchedKvView and KvSlot were promoted to oxillama-arch/traits.rs for cross-crate reuse, and ForwardPass::forward_batched now has a default trait implementation alongside a concrete optimized LLaMA specialisation. On top of that, the previously stubbed DBRX (mixture-of-experts), Grok-1, and Mamba-2 loaders now perform full weight loading from GGUF — Mamba-2 even gets an embed() override for its token-embedding logic — and all three pass end-to-end forward-pass tests.

oxillama-quant — quantization coverage. Complete IQ3_S and IQ3_XXS codebook tables were added, so both formats now dequantize correctly across all kernel paths.

Getting Started

Add the library to your project:

cargo add oxillama

Then grab a model and start chatting — no Python required:

# Pull a GGUF model directly from HuggingFace Hub
oxillama hub pull some-org/some-model-gguf

# Launch the full-screen TUI chat against the downloaded weights
oxillama chat --tui --model some-model.Q4_K_M.gguf

Prefer a one-shot generation or an HTTP endpoint? Both work from the same binary:

# Single prompt
oxillama run --model some-model.Q4_K_M.gguf --prompt "Explain RoPE in one paragraph." --max-tokens 256 --temp 0.7

# OpenAI-compatible server
oxillama serve --model some-model.Q4_K_M.gguf --host 0.0.0.0 --port 8080

The server speaks the OpenAI-compatible HTTP API — POST /v1/chat/completions, /v1/completions, and /v1/embeddings — with SSE streaming terminated by [DONE].

On x86-64 with 8 cores and AVX2, OxiLLaMa targets at least 80% of llama.cpp throughput: LLaMA-3-8B Q4_K_M lands around 30 t/s (target >= 25 t/s), Mistral-7B Q4_K_M around 32 t/s (target >= 27 t/s), and OxiBonsai’s Bonsai-8B Q1_0_G128 1-bit quant around 25 t/s (target >= 22 t/s).

What’s New in 0.1.2

Conversation save/resume — interactive /save and /load slash commands in the chat REPL; Session::save() / Session::load() serialize the full conversation history via oxicode, a SHA-256 KV sidecar validates integrity on load, and a schema-version guard rejects incompatible session files. Conversations now persist across process restarts.
oxillama hub subcommand — hub pull, hub list, and hub rm download models directly from HuggingFace Hub (built on hf-hub 0.5), list cached models, and remove cached entries. Uses ureq with rustls for Pure Rust TLS and the directories crate for cache paths. No Python required.
TUI chat mode — oxillama chat --tui launches a full-screen terminal UI (ratatui 0.30 + crossterm 0.29) with scrollable history, an input line, and live streaming via spawn_blocking + an mpsc channel so token delivery never blocks the event loop. 6 unit tests cover layout, input handling, and message rendering.
DBRX, Grok-1, Mamba-2 real weight loading — the previously stubbed loaders now perform full weight loading from GGUF: Databricks DBRX mixture-of-experts, xAI Grok-1, and the Mamba-2 state-space model (with an embed() override for Mamba-2-specific token embedding). All three pass end-to-end forward-pass tests.
KvCacheAccess trait extensions — new kv_dim(), for_each_key(), and for_each_value() methods; PagedKvCache fully implements multi-page support. BatchedKvView + KvSlot promoted to oxillama-arch/traits.rs; ForwardPass::forward_batched gains a default implementation plus a concrete optimized LLaMA specialisation.
IQ3_S and IQ3_XXS codebook tables — complete codebook data added; both formats now dequantize correctly across all kernel paths.

Tips

Manage your cache from the CLI. Pull what you need with oxillama hub pull <org/repo>, see what’s local with oxillama hub list, and reclaim disk space with oxillama hub rm <org/repo> — no manual cache spelunking required.
Persist a long conversation. Inside the chat REPL, type /save my-thread to snapshot the full history (oxicode-serialized, SHA-256-validated), then later /load my-thread to pick up exactly where you left off, even after a restart.
Use the TUI for long sessions. oxillama chat --tui gives you a scrollable, streaming UI that stays responsive while tokens arrive — much nicer than a flat REPL for extended back-and-forth.
Iterate the KV cache per layer. The new for_each_key() and for_each_value() methods on KvCacheAccess let you walk keys and values layer by layer, with kv_dim() reporting the per-entry dimension — handy for custom cache inspection or batched views.
Try the new architectures. DBRX, Grok-1, and Mamba-2 GGUFs now load real weights end to end — point oxillama run or chat at one and it just works.
Install the CLI globally. cargo install oxillama-cli puts oxillama on your PATH so hub, chat, run, serve, and info are always a command away.

This is the foundation

OxiLLaMa 0.1.2 fits cleanly into the COOLJAPAN Pure Rust stack as it stands today. Tensor primitives and neural ops come from SciRS2 (scirs2-core/linalg/neural 0.4.2), dense linear algebra from OxiBLAS 0.2.1, and RoPE’s FFT from OxiFFT 0.2.0 — with MeCrab handling Japanese tokenization. Session persistence is serialized through oxicode 0.2.1, model downloads ride hf-hub 0.5 over ureq/rustls for fully Pure Rust networking, and OxiBonsai’s Q1_0_G128 1-bit quantization keeps even 8B-class models lean. Every layer is C/C++/Fortran-free, so the whole thing compiles to native binaries, WebAssembly, and embedded targets from a single codebase.

Repository: https://github.com/cool-japan/oxillama

Star the repo if you want LLM inference you can read, audit, and ship anywhere — without a C++ toolchain or a Python detour.

Pure Rust LLM inference is here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ April 25, 2026