Every previous CUDA build trusted cudarc’s FFI layer to talk to the driver correctly. As of 0.1.4, TrustformeRS proves its own GPU math instead — on real silicon.
On July 2 we released TrustformeRS 0.1.4 — a release that migrates the CUDA backend from cudarc to the Pure-Rust oxicuda stack, moves Metal compute onto oxicuda-metal, and verifies both against real hardware with golden-parity tests instead of CPU-only approximations.
No C. No Fortran. No cudarc FFI surface sitting between your model and the GPU driver, and no scirs2-core MPS dependency behind Metal anymore either. TrustformeRS’s GPU path now runs on oxicuda-blas, oxicuda-dnn, oxicuda-memory, and oxicuda-driver for CUDA, and oxicuda-metal/oxicuda-backend for Apple Silicon — both Pure Rust, both checked against real devices. TrustformeRS compiles to a single static binary — or to WASM, or onto mobile — and runs anywhere Rust runs.
Why TrustformeRS 0.1.4 is a game changer
The incumbent path to GPU-accelerated transformers looks like this:
- A PyTorch/C++ stack that needs
nvcc, a matching CUDA SDK, and alibtorch/cuDNN shared object chased across every machine you deploy to. cudarc-style bindings that wrap the CUDA driver API behind an FFI boundary — every call isunsafe, and you’re trusting someone else’s translation of the driver contract.- Metal compute borrowed through
scirs2-core’s MPS integration — an extra dependency hop, with the GPU code living outside the crate that actually uses it. - Newer architectures (RWKV, Mamba) whose Python bindings had drifted behind an old PyO3 version, so
AutoModelwould silently hand you a BERT model instead of the one you asked for.
TrustformeRS 0.1.4 ends all of that:
- The CUDA backend is now 100%
oxicuda—oxicuda-blas,oxicuda-dnn,oxicuda-memory,oxicuda-driver— andcudarcis gone from the workspace entirely (it survives only inside the out-of-workspace legacytrustformers-cFFI crate). - 12/12 CPU↔CUDA golden-parity tests pass on real NVIDIA hardware — GEMM, GELU, LayerNorm, causal softmax, RoPE, and cached-weight GEMM, covering both host and GPU-resident paths, runtime-verified on an RTX A4000 (CUDA 12.0).
matmul_gpu_to_gpuis confirmed genuinely zero-copy — a cachedDeviceBuffer, no host round-trip.- A real GPU-resident CUDA transformer layer: LayerNorm → QKV → bias → RoPE → causal-softmax attention → proj → residual, chained entirely through cached device buffers, replacing the previous CPU-fallback placeholder.
- Metal moved to
oxicuda-metal, dropping thescirs2-coreMPS dependency outright; GPU-resident matmul is zero-copy and verified on Apple Silicon, and GPT-2’s feed-forward now runs as one fused matmul+bias+GELU kernel instead of three separate dispatches. - Production code is now
unwrap()/expect()-free, workspace-wide — replaced with real error propagation, lock-poison recovery, and documented infallible invariants, with no public API changes.
Technical Deep Dive
1. trustformers-core — the GPU backend layer. The cuda feature now pulls oxicuda-blas/-dnn/-memory/-driver; cuda-oxicuda remains only as a deprecated alias. The metal feature pulls oxicuda-metal/oxicuda-backend alongside objc2. Both are optional, both are Pure Rust, and the CPU reference kernel they’re checked against — kernels/rope.rs, GPT-NeoX half-split convention — is now the only RoPE implementation in the tree: 0.1.4 also deleted an orphaned, never-mounted ~1,693-line rope/mod.rs that used an inconsistent convention.
2. trustformers-models — 49+ architectures, real GPU wiring on two of them. GPU-resident forward passes are wired end-to-end for GPT-2 and RetNet today; the rest still run CPU f32 while broader coverage lands. Enabling cuda or metal on trustformers-core now propagates through trustformers-models and the trustformers umbrella crate.
3. trustformers-wasm — WebGPU gets a real device. The WebAssembly compute backend now performs real navigator.gpu → adapter → device initialization, falling back to CPU when no adapter is present, instead of stopping short of an actual device handle.
4. trustformers-serve and trustformers-py — the edges. gRPC serving is back (tonic 0.14’s split tonic-build/tonic-prost-build API), and trustformers-serve remains the one crate keeping non-Pure-Rust TLS (rustls/aws-lc-rs) as an accepted exception — its lambda and swagger-ui adapters are opt-in features, off by default. On the Python side, bindings were re-modernized onto PyO3 0.28, and PyRwkvModel/PyMambaModel are real classes now, not stand-ins.
Getting Started
cargo add trustformers
use trustformers::prelude::*;
use trustformers::{AutoModel, AutoTokenizer, Tensor};
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Load model and tokenizer
let tokenizer = AutoTokenizer::from_pretrained("bert-base-uncased")?;
let model = AutoModel::from_pretrained("bert-base-uncased")?;
// Tokenize input
let tokenized = tokenizer.encode("Hello, Rust world!")?;
// AutoModel's `Model` impl is Tensor-in/Tensor-out, so wrap the token IDs
// as a Tensor before running inference.
let ids: Vec<f32> = tokenized.input_ids.iter().map(|&id| id as f32).collect();
let len = ids.len();
let inputs = Tensor::from_vec(ids, &[len])?;
let outputs = model.forward(inputs)?;
println!("Output shape: {:?}", outputs.shape());
Ok(())
}
To run that same forward pass on a GPU today, reach for a model with real device wiring (GPT-2 or RetNet) and move its weights over explicitly:
// Cargo.toml: trustformers-core = { version = "0.1", features = ["metal"] } // or "cuda"
use trustformers_core::Device;
use trustformers_models::gpt2::{Gpt2Config, Gpt2Model};
let device = Device::Metal(0); // or Device::CUDA(0)
let mut model = Gpt2Model::new_with_device(Gpt2Config::default(), device)?;
model.weights_to_gpu(&device)?; // Metal (use `weights_to_gpu_cuda` on CUDA)
let outputs = model.forward(inputs)?; // attention + linear run on-device, oxicuda-backed
What’s New in 0.1.4
Added
- Real
PyRwkvModel/PyMambaModelPython classes —AutoModelnow loads RWKV/Mamba checkpoints correctly instead of silently falling back to BERT (PyO3 bindings modernized to 0.28). - WebGPU device/queue initialization in the WebAssembly compute backend, with CPU fallback when no adapter is available.
- 12 CPU↔CUDA golden-parity tests (GEMM, GELU, LayerNorm, causal softmax, RoPE, cached-weight GEMM), runtime-verified 12/12 passing on a real RTX A4000 (CUDA 12.0).
Changed
- CUDA backend migrated from
cudarcto Pure-Rustoxicudaend to end;cuda-oxicudakept only as a deprecated alias forcuda. - The CUDA transformer layer now runs as a real GPU-resident pre-norm causal self-attention layer instead of a CPU-fallback placeholder.
- Metal GPU compute migrated from
scirs2-coreMPS tooxicuda-metal; GPU-resident matmul is zero-copy, verified on Apple Silicon. - All production
unwrap()/expect()calls eliminated workspace-wide; no public API changes. - Default feature trees are now Pure-Rust for every crate except
trustformers-serve; HuggingFace Hub networking, remote leaderboard storage, debug visualization, and serve’s Lambda/Swagger-UI adapters moved behindhub,remote-leaderboard,visual,lambda, andswagger-uifeatures respectively. The tokenizer regex backend switched to Pure-Rustfancy-regex. - GPT-2’s feed-forward on Apple Silicon now uses a single fused matmul+bias+GELU Metal kernel.
- Workspace dependency bumps across the board (
sha20.11,nalgebra0.35,tokenizers0.23,candle0.11,tonic0.14,wgpu30, and more).
Removed
- The
cudarcdependency and its entire legacy CUDA backend intrustformers-core— superseded byoxicuda. - An orphaned, never-mounted
rope/mod.rsmodule (~1,693 lines) using a RoPE convention inconsistent with the live, parity-tested kernel.
Fixed
- gRPC proto compilation and serving restored, migrated to
tonic0.14’s splittonic-build/tonic-prost-buildAPI.
Tips
- Switch
cuda-oxicudato plaincuda. The old feature name is now a deprecated alias —trustformers-core = { features = ["cuda"] }gets you the sameoxicuda-blas/-dnn/-memory/-driverbackend with one less thing to explain to your team. - On Apple Silicon, turn on
metal. It now runs throughoxicuda-metalwith zero-copy GPU-resident matmul and a fused GPT-2 feed-forward kernel — noscirs2-coreMPS dependency to pull in anymore. - Keep default builds Pure Rust on purpose.
hub(HuggingFace downloads),remote-leaderboard,visual(plotters/ratatui debug UI), and serve’slambda/swagger-uiare all opt-in now — enable only the ones you actually need instead of inheriting their non-Pure-Rust dependency chains by default. - If you’re on the Python bindings, re-check RWKV/Mamba loading.
AutoModel.from_pretrained(...)on a RWKV or Mamba checkpoint now returns a realPyRwkvModel/PyMambaModel— if your code had a workaround for the old silent-BERT-fallback bug, it’s safe to remove. - Have an NVIDIA GPU? The 12 new CPU↔CUDA parity tests are the fastest way to confirm your driver/CUDA combination behaves like the RTX A4000/CUDA 12.0 configuration this release was verified against.
- Building with
wgpu_backend?wgpumoved to 30.0 and now setsapply_limit_buckets: falseon adapter requests — intentional, since this is a trusted native compute backend rather than untrusted web content — and the fourBufferSlice::get_mapped_range()call sites were updated for its newResultreturn type.
This is the foundation
TrustformeRS 0.1.4 landed a day after OxiCUDA 0.4.0 and SciRS2 0.6.0 — and picks up both immediately: oxicuda-blas/-dnn/-memory/-driver/-metal/-backend for GPU compute, scirs2-core/scirs2-linalg 0.6.0 for numerics and linear algebra. Underneath that, OxiBLAS provides Pure-Rust BLAS/LAPACK, OxiCode handles serialization, OxiARC (oxiarc-archive/-deflate/-lz4/-zstd) backs compression, and oxisql-sqlite-compat provides the Pure-Rust SQLite-compatible export backend. It sits beside ToRSh, SkleaRS, and the rest of the COOLJAPAN model-training and serving stack.
Repository: https://github.com/cool-japan/trustformers
Star the repo if you want GPU-accelerated transformers whose CUDA and Metal paths you can actually read, all the way down to the kernel.
The era of trusting an opaque CUDA FFI wrapper is over. Pure Rust GPU-accelerated transformers are here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ July 2, 2026