Manage models, chat, and resume conversations — all from one Pure Rust binary, with no Python and no C in sight.
Today we released OxiLLaMa 0.1.2 — a focused developer-experience and model-coverage point release that adds direct HuggingFace Hub model management, a full-screen streaming TUI chat, durable conversation save/resume, and real weight loading for three more architectures.
No C. No C++. No Fortran. No FFI. No system libraries. Where llama.cpp drags a C++ build chain and platform-specific toolchains behind it, OxiLLaMa compiles to a single static binary — or to WebAssembly — from one codebase. The whole engine is ~107,000 lines of Pure Rust across 11 crates, with 2,020 tests passing.
Why OxiLLaMa 0.1.2 matters
llama.cpp is impressive engineering, but it is C and C++ all the way down. That means manual memory management (and the segfaults that come with it), a heavyweight build with platform-specific dependencies, awkward WebAssembly and embedded stories, and — for anything beyond raw inference — a detour through Python tooling just to fetch and manage models.
OxiLLaMa 0.1.2 closes several of those gaps at once:
- Download models straight from HuggingFace Hub —
oxillama hub pull <org/repo>fetches GGUF weights with zero Python in the loop, usinghf-hubover Pure Rust TLS. - Chat in a full-screen terminal UI —
oxillama chat --tuigives you a scrollable history and an input line with live token streaming that never stutters. - Save and resume conversations across restarts — the chat REPL’s
/saveand/loadpersist full history, with a SHA-256 sidecar that verifies state on load. - Three more architectures load real weights — Databricks DBRX, xAI Grok-1, and the Mamba-2 state-space model are no longer stubs; they load and run end to end.
All of it is memory-safe Rust, and all of it ships in the same binary you already build.
Technical Deep Dive: Hub, TUI, sessions, and architectures
oxillama-cli — model management and the TUI. The new hub subcommand exposes hub pull, hub list, and hub rm, built on hf-hub 0.5. Downloads go over ureq with rustls for Pure Rust TLS, and the directories crate resolves platform-appropriate cache paths so models land in the right place on Linux, macOS, and Windows alike. The TUI chat mode (chat --tui) is built on ratatui 0.30 and crossterm 0.29: it renders a scrollable chat history and an input line, and drives live streaming through spawn_blocking plus an mpsc channel so token delivery never blocks the event loop. Six unit tests cover layout, input handling, and message rendering.
oxillama-runtime — session persistence. Conversation save/resume lives in session.rs. Session::save() and Session::load() serialize the full conversation history via oxicode, the COOLJAPAN Pure Rust codec. A SHA-256 KV sidecar validates integrity on load, and a schema-version guard rejects incompatible session files outright rather than misinterpreting them. The result is conversations that survive process restarts without silent corruption.
oxillama-arch — KV cache and architectures. The KvCacheAccess trait gains kv_dim(), for_each_key(), and for_each_value(), and PagedKvCache fully implements multi-page support for all three. BatchedKvView and KvSlot were promoted to oxillama-arch/traits.rs for cross-crate reuse, and ForwardPass::forward_batched now has a default trait implementation alongside a concrete optimized LLaMA specialisation. On top of that, the previously stubbed DBRX (mixture-of-experts), Grok-1, and Mamba-2 loaders now perform full weight loading from GGUF — Mamba-2 even gets an embed() override for its token-embedding logic — and all three pass end-to-end forward-pass tests.
oxillama-quant — quantization coverage. Complete IQ3_S and IQ3_XXS codebook tables were added, so both formats now dequantize correctly across all kernel paths.
Getting Started
Add the library to your project:
cargo add oxillama
Then grab a model and start chatting — no Python required:
# Pull a GGUF model directly from HuggingFace Hub
oxillama hub pull some-org/some-model-gguf
# Launch the full-screen TUI chat against the downloaded weights
oxillama chat --tui --model some-model.Q4_K_M.gguf
Prefer a one-shot generation or an HTTP endpoint? Both work from the same binary:
# Single prompt
oxillama run --model some-model.Q4_K_M.gguf --prompt "Explain RoPE in one paragraph." --max-tokens 256 --temp 0.7
# OpenAI-compatible server
oxillama serve --model some-model.Q4_K_M.gguf --host 0.0.0.0 --port 8080
The server speaks the OpenAI-compatible HTTP API — POST /v1/chat/completions, /v1/completions, and /v1/embeddings — with SSE streaming terminated by [DONE].
On x86-64 with 8 cores and AVX2, OxiLLaMa targets at least 80% of llama.cpp throughput: LLaMA-3-8B Q4_K_M lands around 30 t/s (target >= 25 t/s), Mistral-7B Q4_K_M around 32 t/s (target >= 27 t/s), and OxiBonsai’s Bonsai-8B Q1_0_G128 1-bit quant around 25 t/s (target >= 22 t/s).
What’s New in 0.1.2
- Conversation save/resume — interactive
/saveand/loadslash commands in the chat REPL;Session::save()/Session::load()serialize the full conversation history via oxicode, a SHA-256 KV sidecar validates integrity on load, and a schema-version guard rejects incompatible session files. Conversations now persist across process restarts. oxillama hubsubcommand —hub pull,hub list, andhub rmdownload models directly from HuggingFace Hub (built on hf-hub 0.5), list cached models, and remove cached entries. Uses ureq with rustls for Pure Rust TLS and thedirectoriescrate for cache paths. No Python required.- TUI chat mode —
oxillama chat --tuilaunches a full-screen terminal UI (ratatui 0.30 + crossterm 0.29) with scrollable history, an input line, and live streaming viaspawn_blocking+ anmpscchannel so token delivery never blocks the event loop. 6 unit tests cover layout, input handling, and message rendering. - DBRX, Grok-1, Mamba-2 real weight loading — the previously stubbed loaders now perform full weight loading from GGUF: Databricks DBRX mixture-of-experts, xAI Grok-1, and the Mamba-2 state-space model (with an
embed()override for Mamba-2-specific token embedding). All three pass end-to-end forward-pass tests. KvCacheAccesstrait extensions — newkv_dim(),for_each_key(), andfor_each_value()methods;PagedKvCachefully implements multi-page support.BatchedKvView+KvSlotpromoted tooxillama-arch/traits.rs;ForwardPass::forward_batchedgains a default implementation plus a concrete optimized LLaMA specialisation.- IQ3_S and IQ3_XXS codebook tables — complete codebook data added; both formats now dequantize correctly across all kernel paths.
Tips
- Manage your cache from the CLI. Pull what you need with
oxillama hub pull <org/repo>, see what’s local withoxillama hub list, and reclaim disk space withoxillama hub rm <org/repo>— no manual cache spelunking required. - Persist a long conversation. Inside the chat REPL, type
/save my-threadto snapshot the full history (oxicode-serialized, SHA-256-validated), then later/load my-threadto pick up exactly where you left off, even after a restart. - Use the TUI for long sessions.
oxillama chat --tuigives you a scrollable, streaming UI that stays responsive while tokens arrive — much nicer than a flat REPL for extended back-and-forth. - Iterate the KV cache per layer. The new
for_each_key()andfor_each_value()methods onKvCacheAccesslet you walk keys and values layer by layer, withkv_dim()reporting the per-entry dimension — handy for custom cache inspection or batched views. - Try the new architectures. DBRX, Grok-1, and Mamba-2 GGUFs now load real weights end to end — point
oxillama runorchatat one and it just works. - Install the CLI globally.
cargo install oxillama-cliputsoxillamaon your PATH sohub,chat,run,serve, andinfoare always a command away.
This is the foundation
OxiLLaMa 0.1.2 fits cleanly into the COOLJAPAN Pure Rust stack as it stands today. Tensor primitives and neural ops come from SciRS2 (scirs2-core/linalg/neural 0.4.2), dense linear algebra from OxiBLAS 0.2.1, and RoPE’s FFT from OxiFFT 0.2.0 — with MeCrab handling Japanese tokenization. Session persistence is serialized through oxicode 0.2.1, model downloads ride hf-hub 0.5 over ureq/rustls for fully Pure Rust networking, and OxiBonsai’s Q1_0_G128 1-bit quantization keeps even 8B-class models lean. Every layer is C/C++/Fortran-free, so the whole thing compiles to native binaries, WebAssembly, and embedded targets from a single codebase.
Repository: https://github.com/cool-japan/oxillama
Star the repo if you want LLM inference you can read, audit, and ship anywhere — without a C++ toolchain or a Python detour.
Pure Rust LLM inference is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ April 25, 2026