Transformers without a Python runtime — load a HuggingFace model, tokenize, and run inference from a single Rust binary.
Today we released TrustformeRS 0.1.0 — the first stable release of a high-performance, memory-safe Pure Rust implementation of Hugging Face Transformers, with model loading, tokenizers, and inference that run from edge to cloud.
No PyTorch. No Python. No CUDA-C. No transformers pip install, no torch wheel, no glibc-pinned container that breaks the moment you change a base image. TrustformeRS replaces the HuggingFace Transformers + PyTorch + Python ML stack with one Rust crate that loads SafeTensors weights directly and compiles to a single static binary — or to WASM for the browser.
Why TrustformeRS 0.1.0 matters
The Python transformer stack is extraordinary, and extraordinarily heavy. A “hello world” inference job drags in a multi-gigabyte PyTorch runtime, a CUDA-C toolchain that must match your driver, and a glibc lineage your container has to honor exactly. The GIL serializes what should be parallel. A version skew between torch, transformers, and a native extension turns into a segfault with no stack trace. Shipping that to an edge device, a mobile app, or a locked-down server is an exercise in fragility.
TrustformeRS takes the other path. It is 100% Pure Rust in its default features — no C or Fortran — so deployment is a copied binary, not a reconstructed environment. The numbers from 0.1.0 are concrete: 17x CPU acceleration for matrix ops via direct cblas_sgemm, 2.88x overall improvement on macOS through Metal/MPS, Flash Attention with batched matmul, 5,007 tests passing (5,010+ across the workspace) at a 100% pass rate, zero clippy warnings under -D warnings, and an MSRV of 1.75. Memory safety is enforced by the compiler, not hoped for at runtime.
Technical Deep Dive
trustformers-core — the tensor and trait layer. At the bottom sits a tensor abstraction and the core traits every model builds on, layered on SciRS2 and OxiBLAS. SciRS2 supplies the SIMD-optimized primitives (ndarray, rand, and rayon are accessed through scirs2-core per the SciRS2 Integration Policy), while OxiBLAS provides the Pure Rust BLAS/LAPACK engine — this is the machinery behind the 17x line. GPU-resident tensor ops, NUMA-aware topology detection on Linux and macOS, and SIMD kernels all live here.
trustformers-models — 27+ architectures. The model crate ships builder-pattern config APIs and 21+ concrete implementations including BERT, GPT-2, T5, LLaMA, Mistral, Falcon, MPT, BLOOM, OPT, Phi, Gemma, Qwen, StableLM, RWKV, Mamba, Flamingo, and CLIP. Weights load directly from HuggingFace SafeTensors format — zero-copy, memory-mapped — including CLIP’s combined text and vision encoders. Conv2D forward is implemented the classic way (im2col + matmul) with full support for groups, dilation, stride, and padding.
trustformers-tokenizers — the text front door. BPE, WordPiece, and SentencePiece are all here, with configurable vocabularies and special tokens, so the tokenizer that produced a model’s training data is the tokenizer you run at inference.
Quantization and the deployment surface. TrustformeRS supports GGML and GGUF formats, AWQ (Activation-aware Weight Quantization), and GPTQ, plus quantization-aware training infrastructure. Deployment fans out across trustformers-wasm (browser + WebGPU), trustformers-serve (gRPC + REST), and trustformers-mobile (iOS/Android), with all compression handled by OxiARC (deflate, zstd, lz4) — Pure Rust, no flate2, no zstd-sys.
Getting Started
Add the high-level integration crate — the workspace meta-crate that exposes AutoModel, AutoTokenizer, and pipeline:
cargo add trustformers
Load a model and run a forward pass:
use trustformers::prelude::*;
use trustformers::{AutoModel, AutoTokenizer};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let tokenizer = AutoTokenizer::from_pretrained("bert-base-uncased")?;
let model = AutoModel::from_pretrained("bert-base-uncased")?;
let inputs = tokenizer.encode("Hello, Rust world!", None)?;
let outputs = model.forward(&inputs)?;
println!("Hidden states shape: {:?}", outputs.last_hidden_state.shape());
Ok(())
}
Or skip the boilerplate entirely with the task pipeline API:
use trustformers::pipeline;
let classifier = pipeline("sentiment-analysis")?;
let result = classifier("I love writing Rust code!")?; // -> label POSITIVE with a score
Other pipelines cover text-generation, token-classification (NER), question-answering, fill-mask, summarization, and translation.
What’s inside
Architectures — 27+ transformer architectures (21+ concrete implementations: BERT, GPT-2, T5, LLaMA, Mistral, Falcon, MPT, BLOOM, OPT, Phi, Gemma, Qwen, StableLM, RWKV, Mamba, Flamingo, CLIP), builder-pattern configs, CLIP text+vision weight loading, Conv2D via im2col.
Performance & hardware — 17x CPU BLAS acceleration via cblas_sgemm; Metal/MPS on macOS (2.88x overall); GPU-resident tensor ops; Flash Attention with batched matmul; CUDA and ROCm backends with automatic CPU fallback; WebGPU compute-shader backend for the browser; SIMD-optimized tensor ops; NUMA-aware topology detection on Linux/macOS.
Quantization — GGML and GGUF format support, AWQ, GPTQ, and quantization-aware training infrastructure.
Tokenizers — BPE, WordPiece, and SentencePiece with configurable vocab and special tokens.
Training — distributed training with model and data parallelism, DPO and KTO loss functions, 20+ optimization algorithms, hyperparameter/auto-tuning, gradient checkpointing, and mixed-precision training.
Deployment — WASM (browser + WebGPU), Python bindings (trustformers-py), C FFI (trustformers-c), Mobile (trustformers-mobile), and a server (gRPC + REST via trustformers-serve).
Safety — content-safety filters (toxicity scoring, harm-pattern detection), model versioning + A/B testing, inference caching (LRU + custom eviction), memory profiling + leak detection, and structured error codes.
Tips
- Reach for
pipeline(...)first. For a working task in three lines — sentiment analysis, NER, summarization — the pipeline API is the fastest path. Drop down toAutoModel/AutoTokenizeronly when you need the hidden states or custom heads. - Let SafeTensors do the loading. Weights are memory-mapped (zero-copy), so
from_pretrainedis cheap on startup and gentle on RAM. Prefer SafeTensors checkpoints when you have a choice. - Turn on the right backend for your box. On macOS, enable Metal for the 2.88x MPS path; on Linux, enable CUDA or ROCm — both fall back to CPU automatically, so the same binary runs everywhere.
- Quantize for the edge. For mobile or embedded targets, pick a quantization scheme (GGUF, AWQ, or GPTQ) to shrink the model and the memory footprint before you ship.
- Ship inference to the browser. Build
trustformers-wasmwith WebGPU to run models client-side — no server round-trip, no inference backend to host. - Trade compute for memory when training. In
trustformers-training, combine mixed precision with gradient checkpointing to fit larger models on smaller GPUs.
This is the foundation
TrustformeRS does not stand alone. It is built on SciRS2 (the SIMD tensor primitives) and OxiBLAS (the Pure Rust BLAS/LAPACK behind the 17x line), with Oxicode replacing bincode for serialization and OxiARC handling every byte of compression. Across the workspace it spans ~900,000+ SLoC in 10 crates: trustformers-core (Stable), trustformers-models, trustformers-tokenizers (Stable), trustformers-optim (Stable), trustformers-training (Stable), trustformers-serve (Stable), trustformers-wasm (Stable), trustformers-mobile, trustformers-debug, and the high-level trustformers integration crate.
In the COOLJAPAN ML stack it sits alongside ToRSh (the PyTorch-equivalent deep-learning framework) and SkleaRS / TenfloweRS — Pure Rust, sovereign infrastructure for inference, training, and classical ML alike.
Repository: https://github.com/cool-japan/trustformers
Star the repo if you want transformers you can deploy as a single binary — and tell us which architecture you load first.
Pure Rust transformers are here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ March 21, 2026