Deep learning without a C++ compiler, a CUDA toolkit, or a Python interpreter — just one Rust binary.
Today we released TenfloweRS 0.1.0 — a research-grade, pure-Rust deep-learning framework that gives you tensors, computation graphs, autodiff, training, and inference with the ergonomics of TensorFlow and none of its build pain.
No C. No C++. No CUDA-C. No Python. TensorFlow’s power has always come bundled with the libtensorflow + CUDA toolchain — a multi-gigabyte stack of compiled C++, vendor drivers, and a Python runtime you have to keep alive at inference time. TenfloweRS throws all of that out. The tensor engine, the graph runtime, the autodiff tape, the neural layers, and the GPU backend are all written in safe Rust. The result compiles to a single static binary (or WASM) you can ship anywhere.
Why TenfloweRS 0.1.0 matters
Anyone who has stood up a TensorFlow project knows the failure modes: a C++/CUDA build that fights your compiler, version-locked driver hell, a Python runtime that has to ride along into production, and the occasional native segfault that gives you a core dump instead of a stack trace. TenfloweRS is a clean break.
- Dual execution modes. Eager execution (PyTorch-style, run-as-you-go) and static computation graphs (TensorFlow 1.x-style). Prototype eagerly, then capture a graph for deployment — same framework, your choice.
- Pure-Rust safety. No unwrap() in the codebase, zero clippy warnings, full formatting compliance, and 0 reported security vulnerabilities. Memory safety is the default, not a code-review checklist.
- Portable GPU. Cross-platform GPU acceleration through WGPU — Metal, Vulkan, and DirectX — so the same code runs on a Mac, a Linux workstation, or a Windows box without a vendor-specific CUDA build.
- Built on a real ecosystem. TenfloweRS stands on NumRS2 and the SciRS2 numerical stack, so its linear algebra, autograd, and array primitives are shared, tested foundations rather than one-off reimplementations.
- Proven by tests. 12,949 tests pass across the workspace — 0 failures, 0 warnings — covering tensor ops, gradients, layers, and the research modules.
Technical Deep Dive: the six-crate workspace
TenfloweRS 0.1.0 is six workspace crates, layered so each one has a single clear job. The familiar TensorFlow concepts map onto Rust types you can reason about statically.
tenflowers-core— the foundation: core tensor operations, the GPU abstraction, the operation/kernel registry, shape inference, kernel fusion, autocast, sparse tensors, and fused ops. Heretf.TensorbecomesTensor<T>(statically typed),tf.Operationbecomes theOptrait with registered kernels, andtf.devicebecomes aDeviceenum for explicit placement control.tenflowers-autograd— reverse-mode automatic differentiation, gradient accumulation, checkpointing, in-place ops, forward-mode gradients, Jacobian checks, and interpretability utilities. TensorFlow’stf.GradientTapebecomesGradientTape.tenflowers-dataset— data loading and preprocessing, distributed streaming, and cache telemetry.tf.data.Datasetbecomes an iterator-basedDatasettrait.tenflowers-neural— comprehensive NN layers, training utilities, and 300+ research-grade algorithm modules.tf.keras.Layerbecomes theLayertrait with a builder pattern.tenflowers-ffi— C FFI plus Python bindings via PyO3 (48 tests;publish=false, since it requires a Python environment) for teams that need to bridge into existing toolchains.tenflowers— the meta-crate: the unified public API, theprelude, and the user-facing macros and re-exports. This is the crate you add.
Graphs and sessions get the same treatment: tf.Graph becomes a Graph struct with clear ownership semantics, and tf.Session becomes a Session trait — so a graph is a value you own, not a global handle you hope is still alive.
Getting Started
Add the meta-crate:
cargo add tenflowers
A minimal end-to-end example — eager tensor math, a small feedforward classifier, and an optimizer ready for the training loop:
use tenflowers::prelude::*;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Eager tensor math
let a = Tensor::<f32>::ones(&[2, 3]);
let b = Tensor::<f32>::ones(&[3, 4]);
let c = ops::matmul(&a, &b)?;
println!("matmul shape: {:?}", c.shape());
// A small feedforward classifier
let model = Sequential::<f32>::new(vec![])
.add(Box::new(Dense::new(784, 128, true).with_activation("relu".to_string())))
.add(Box::new(Dense::new(128, 10, true).with_activation("softmax".to_string())));
let input = Tensor::<f32>::zeros(&[32, 784]);
let logits = model.forward(&input)?;
println!("logits shape: {:?}", logits.shape());
// Optimizer for the training loop
let _optimizer = SGD::<f32>::new(0.01);
Ok(())
}
For custom training, reach for GradientTape — the direct analogue of tf.GradientTape. You record the forward pass on the tape, then ask it for gradients with respect to your parameters, and feed those into the optimizer. It is the same mental model TensorFlow users already have, expressed in Rust types.
What’s inside
- Six focused crates —
tenflowers-core,-autograd,-dataset,-neural,-ffi, and thetenflowersmeta-crate. - Dual execution — eager mode for fast iteration, static graphs for capture and deployment.
- Reverse-mode autodiff — with gradient accumulation, checkpointing, forward-mode gradients, and Jacobian checks.
- 150+ research domains, 300+ modules — a deep neural library spanning Flash Attention, ALiBi, RoPE, and transformer decoders; optimizers like LAMB, Lion, and Muon; diffusion (DDPM), VAEs, and normalizing flows; graph and geometric deep learning (GNNs, graph transformers, EGNN, SE(3)-Transformer, AlphaFold2-style IPA); quantum ML (QAOA, quantum kernels); federated learning (Krum, Bulyan); operator learning (FNO, WNO, PINO); efficient transformers (RetNet, Mamba-2, GQA, KV-cache); and 3D Gaussian splatting.
- ONNX import/export — for cross-framework compatibility.
- WGPU GPU acceleration — Metal, Vulkan, DirectX, behind feature flags (
gpu,cuda,cudnn,opencl,metal,rocm,nccl,simd). - Python FFI — C FFI plus PyO3 bindings for bridging into existing pipelines.
- Note on Tensorboard — Tensorboard integration is intentionally excluded from this release because of an upstream advisory (RUSTSEC-2024-0437, protobuf 2.x); it will be restored once the upstream issue is fixed.
Tips
- Turn on the GPU. Enable a backend feature for your platform —
metalon a Mac, or the genericgpu/vulkanpath elsewhere — to move tensor work off the CPU.cargo add tenflowers --features metal - Pick the right execution mode. Iterate in eager mode while you are exploring; switch to a static
Graphwhen you want a stable, capturable computation for deployment. - Use
GradientTapefor custom training loops. WhenSequentialand a built-in optimizer are not enough, record your forward pass on aGradientTapeand drive the backward pass yourself. - Mine
tenflowers-neuralbefore you reimplement. Need Flash Attention, Mamba-2, a graph transformer, or an FNO? They are already in the neural crate — import them rather than writing them from scratch. - Export to ONNX for interop. When you need to hand a model to another framework or runtime, use the ONNX export path.
- Start from the prelude.
use tenflowers::prelude::*;pulls inTensor,ops, the layers, and the optimizers — the meta-crate is the only dependency you need to add.
This is the foundation
TenfloweRS does not live alone. It is built on NumRS2 for arrays and the SciRS2 stack (scirs2-core, -autograd, -neural, -linalg, -numpy) for numerical primitives, on OptiRS for optimization, and it serializes through Oxicode. That shared base is why a first release can already ship 12,949 passing tests and roughly 641K lines of Rust across 1,453 source files — most of the hard numerical groundwork was already laid and battle-tested by its siblings. It joins ToRSh, our other pure-Rust deep-learning framework, in the COOLJAPAN family — different design centers, same goal of native, sovereign ML.
Repository: https://github.com/cool-japan/tenflowers
Star the repo if you want deep learning that builds in seconds, ships as one binary, and never asks for a CUDA toolkit. Pure Rust deep learning is here — fast, safe, and sovereign.
— KitaSan at COOLJAPAN OÜ March 20, 2026