TrustformeRS 0.1.4 Released — Pure-Rust CUDA Replaces cudarc, Verified on Real NVIDIA Hardware

Every previous CUDA build trusted cudarc’s FFI layer to talk to the driver correctly. As of 0.1.4, TrustformeRS proves its own GPU math instead — on real silicon.

On July 2 we released TrustformeRS 0.1.4 — a release that migrates the CUDA backend from cudarc to the Pure-Rust oxicuda stack, moves Metal compute onto oxicuda-metal, and verifies both against real hardware with golden-parity tests instead of CPU-only approximations.

No C. No Fortran. No cudarc FFI surface sitting between your model and the GPU driver, and no scirs2-core MPS dependency behind Metal anymore either. TrustformeRS’s GPU path now runs on oxicuda-blas, oxicuda-dnn, oxicuda-memory, and oxicuda-driver for CUDA, and oxicuda-metal/oxicuda-backend for Apple Silicon — both Pure Rust, both checked against real devices. TrustformeRS compiles to a single static binary — or to WASM, or onto mobile — and runs anywhere Rust runs.

Why TrustformeRS 0.1.4 is a game changer

The incumbent path to GPU-accelerated transformers looks like this:

A PyTorch/C++ stack that needs nvcc, a matching CUDA SDK, and a libtorch/cuDNN shared object chased across every machine you deploy to.
cudarc-style bindings that wrap the CUDA driver API behind an FFI boundary — every call is unsafe, and you’re trusting someone else’s translation of the driver contract.
Metal compute borrowed through scirs2-core’s MPS integration — an extra dependency hop, with the GPU code living outside the crate that actually uses it.
Newer architectures (RWKV, Mamba) whose Python bindings had drifted behind an old PyO3 version, so AutoModel would silently hand you a BERT model instead of the one you asked for.

TrustformeRS 0.1.4 ends all of that:

The CUDA backend is now 100% oxicuda — oxicuda-blas, oxicuda-dnn, oxicuda-memory, oxicuda-driver — and cudarc is gone from the workspace entirely (it survives only inside the out-of-workspace legacy trustformers-c FFI crate).
12/12 CPU↔CUDA golden-parity tests pass on real NVIDIA hardware — GEMM, GELU, LayerNorm, causal softmax, RoPE, and cached-weight GEMM, covering both host and GPU-resident paths, runtime-verified on an RTX A4000 (CUDA 12.0).
matmul_gpu_to_gpu is confirmed genuinely zero-copy — a cached DeviceBuffer, no host round-trip.
A real GPU-resident CUDA transformer layer: LayerNorm → QKV → bias → RoPE → causal-softmax attention → proj → residual, chained entirely through cached device buffers, replacing the previous CPU-fallback placeholder.
Metal moved to oxicuda-metal, dropping the scirs2-core MPS dependency outright; GPU-resident matmul is zero-copy and verified on Apple Silicon, and GPT-2’s feed-forward now runs as one fused matmul+bias+GELU kernel instead of three separate dispatches.
Production code is now unwrap()/expect()-free, workspace-wide — replaced with real error propagation, lock-poison recovery, and documented infallible invariants, with no public API changes.

Technical Deep Dive

1. trustformers-core — the GPU backend layer. The cuda feature now pulls oxicuda-blas/-dnn/-memory/-driver; cuda-oxicuda remains only as a deprecated alias. The metal feature pulls oxicuda-metal/oxicuda-backend alongside objc2. Both are optional, both are Pure Rust, and the CPU reference kernel they’re checked against — kernels/rope.rs, GPT-NeoX half-split convention — is now the only RoPE implementation in the tree: 0.1.4 also deleted an orphaned, never-mounted ~1,693-line rope/mod.rs that used an inconsistent convention.

2. trustformers-models — 49+ architectures, real GPU wiring on two of them. GPU-resident forward passes are wired end-to-end for GPT-2 and RetNet today; the rest still run CPU f32 while broader coverage lands. Enabling cuda or metal on trustformers-core now propagates through trustformers-models and the trustformers umbrella crate.

3. trustformers-wasm — WebGPU gets a real device. The WebAssembly compute backend now performs real navigator.gpu → adapter → device initialization, falling back to CPU when no adapter is present, instead of stopping short of an actual device handle.

4. trustformers-serve and trustformers-py — the edges. gRPC serving is back (tonic 0.14’s split tonic-build/tonic-prost-build API), and trustformers-serve remains the one crate keeping non-Pure-Rust TLS (rustls/aws-lc-rs) as an accepted exception — its lambda and swagger-ui adapters are opt-in features, off by default. On the Python side, bindings were re-modernized onto PyO3 0.28, and PyRwkvModel/PyMambaModel are real classes now, not stand-ins.

Getting Started

cargo add trustformers

use trustformers::prelude::*;
use trustformers::{AutoModel, AutoTokenizer, Tensor};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Load model and tokenizer
    let tokenizer = AutoTokenizer::from_pretrained("bert-base-uncased")?;
    let model = AutoModel::from_pretrained("bert-base-uncased")?;

    // Tokenize input
    let tokenized = tokenizer.encode("Hello, Rust world!")?;

    // AutoModel's `Model` impl is Tensor-in/Tensor-out, so wrap the token IDs
    // as a Tensor before running inference.
    let ids: Vec<f32> = tokenized.input_ids.iter().map(|&id| id as f32).collect();
    let len = ids.len();
    let inputs = Tensor::from_vec(ids, &[len])?;

    let outputs = model.forward(inputs)?;
    println!("Output shape: {:?}", outputs.shape());
    Ok(())
}

To run that same forward pass on a GPU today, reach for a model with real device wiring (GPT-2 or RetNet) and move its weights over explicitly:

// Cargo.toml: trustformers-core = { version = "0.1", features = ["metal"] }  // or "cuda"
use trustformers_core::Device;
use trustformers_models::gpt2::{Gpt2Config, Gpt2Model};

let device = Device::Metal(0); // or Device::CUDA(0)
let mut model = Gpt2Model::new_with_device(Gpt2Config::default(), device)?;
model.weights_to_gpu(&device)?;        // Metal (use `weights_to_gpu_cuda` on CUDA)
let outputs = model.forward(inputs)?;  // attention + linear run on-device, oxicuda-backed

What’s New in 0.1.4

Added

Real PyRwkvModel / PyMambaModel Python classes — AutoModel now loads RWKV/Mamba checkpoints correctly instead of silently falling back to BERT (PyO3 bindings modernized to 0.28).
WebGPU device/queue initialization in the WebAssembly compute backend, with CPU fallback when no adapter is available.
12 CPU↔CUDA golden-parity tests (GEMM, GELU, LayerNorm, causal softmax, RoPE, cached-weight GEMM), runtime-verified 12/12 passing on a real RTX A4000 (CUDA 12.0).

Changed

CUDA backend migrated from cudarc to Pure-Rust oxicuda end to end; cuda-oxicuda kept only as a deprecated alias for cuda.
The CUDA transformer layer now runs as a real GPU-resident pre-norm causal self-attention layer instead of a CPU-fallback placeholder.
Metal GPU compute migrated from scirs2-core MPS to oxicuda-metal; GPU-resident matmul is zero-copy, verified on Apple Silicon.
All production unwrap()/expect() calls eliminated workspace-wide; no public API changes.
Default feature trees are now Pure-Rust for every crate except trustformers-serve; HuggingFace Hub networking, remote leaderboard storage, debug visualization, and serve’s Lambda/Swagger-UI adapters moved behind hub, remote-leaderboard, visual, lambda, and swagger-ui features respectively. The tokenizer regex backend switched to Pure-Rust fancy-regex.
GPT-2’s feed-forward on Apple Silicon now uses a single fused matmul+bias+GELU Metal kernel.
Workspace dependency bumps across the board (sha2 0.11, nalgebra 0.35, tokenizers 0.23, candle 0.11, tonic 0.14, wgpu 30, and more).

Removed

The cudarc dependency and its entire legacy CUDA backend in trustformers-core — superseded by oxicuda.
An orphaned, never-mounted rope/mod.rs module (~1,693 lines) using a RoPE convention inconsistent with the live, parity-tested kernel.

Fixed

gRPC proto compilation and serving restored, migrated to tonic 0.14’s split tonic-build/tonic-prost-build API.

Tips

Switch cuda-oxicuda to plain cuda. The old feature name is now a deprecated alias — trustformers-core = { features = ["cuda"] } gets you the same oxicuda-blas/-dnn/-memory/-driver backend with one less thing to explain to your team.
On Apple Silicon, turn on metal. It now runs through oxicuda-metal with zero-copy GPU-resident matmul and a fused GPT-2 feed-forward kernel — no scirs2-core MPS dependency to pull in anymore.
Keep default builds Pure Rust on purpose. hub (HuggingFace downloads), remote-leaderboard, visual (plotters/ratatui debug UI), and serve’s lambda/swagger-ui are all opt-in now — enable only the ones you actually need instead of inheriting their non-Pure-Rust dependency chains by default.
If you’re on the Python bindings, re-check RWKV/Mamba loading. AutoModel.from_pretrained(...) on a RWKV or Mamba checkpoint now returns a real PyRwkvModel/PyMambaModel — if your code had a workaround for the old silent-BERT-fallback bug, it’s safe to remove.
Have an NVIDIA GPU? The 12 new CPU↔CUDA parity tests are the fastest way to confirm your driver/CUDA combination behaves like the RTX A4000/CUDA 12.0 configuration this release was verified against.
Building with wgpu_backend? wgpu moved to 30.0 and now sets apply_limit_buckets: false on adapter requests — intentional, since this is a trusted native compute backend rather than untrusted web content — and the four BufferSlice::get_mapped_range() call sites were updated for its new Result return type.

This is the foundation

TrustformeRS 0.1.4 landed a day after OxiCUDA 0.4.0 and SciRS2 0.6.0 — and picks up both immediately: oxicuda-blas/-dnn/-memory/-driver/-metal/-backend for GPU compute, scirs2-core/scirs2-linalg 0.6.0 for numerics and linear algebra. Underneath that, OxiBLAS provides Pure-Rust BLAS/LAPACK, OxiCode handles serialization, OxiARC (oxiarc-archive/-deflate/-lz4/-zstd) backs compression, and oxisql-sqlite-compat provides the Pure-Rust SQLite-compatible export backend. It sits beside ToRSh, SkleaRS, and the rest of the COOLJAPAN model-training and serving stack.

Repository: https://github.com/cool-japan/trustformers

Star the repo if you want GPU-accelerated transformers whose CUDA and Metal paths you can actually read, all the way down to the kernel.

The era of trusting an opaque CUDA FFI wrapper is over. Pure Rust GPU-accelerated transformers are here — fast, safe, and sovereign.

— KitaSan at COOLJAPAN OÜ July 2, 2026