A DataFrame that returns a number is easy. A DataFrame that returns the correct number is the whole job.
Today we released PandRS 0.4.0 — the release where PandRS’s analytics and machine learning genuinely compute correct answers. This is the biggest PandRS release to date: a deep, correctness-forward minor that rips out fabricated and stubbed results across the ML and statistics layers and replaces them with real, tested algorithms, while hard-wiring PandRS to the Pure Rust scientific stack through a now-mandatory SciRS2-Core dependency.
No C. No Cython. No pandas or scikit-learn/SciPy C-extensions. No Python GIL.
No conda environments, no platform-specific wheels, no native segfaults hiding under a friendly API.
Just a memory-safe, high-performance DataFrame that compiles to a single static binary (or WASM) and computes the same answer everywhere — laptop, server, edge, browser.
Why PandRS 0.4.0 is a game changer
There is a quiet trap in the analytics world: libraries that return numbers without actually computing them. A clustering routine that hands back an array of zeros. A logistic regression whose evaluate reports a hardcoded accuracy = 0.85. A ROC-AUC that is always 0.75. A silhouette score frozen at 0.75. A PCA that emits zero components. These pass a smoke test, they look plausible in a demo, and they are completely, silently wrong.
PandRS had its share of these stubs. 0.4.0 is the release that pays the debt. Every one of those placeholders has been replaced with a genuine implementation, validated by correctness assertions:
- Real dimensionality reduction — PCA now performs an honest Jacobi-rotation eigendecomposition; t-SNE runs a real perplexity-tuned embedding with gradient descent.
- Real clustering — DBSCAN does density-based region growing; agglomerative clustering does true bottom-up linkage; silhouette scores are actually computed.
- Real supervised learning — logistic regression is fit by Iteratively Reweighted Least Squares with L2 regularization; metrics come from a real confusion matrix.
- Real anomaly detection — isolation forests, Local Outlier Factor, and One-Class SVM now score and label genuinely.
- Real statistical inference — chi-square, t, and F p-values are computed from real distribution tails rather than binary thresholds.
And underneath all of it, the data path is now Pure Rust end to end: DataFrame → statistics → linear algebra runs on SciRS2, with scirs2-core promoted to a non-optional core dependency. The whole chain is genuine, and it is sovereign.
Technical Deep Dive: Four Layers That Now Tell the Truth
1. The real ML engine (unsupervised + supervised).
The pandrs::ml layer was rebuilt around algorithms, not stand-ins. PCA computes the covariance matrix, decomposes it via Jacobi rotations, stores real principal components and explained-variance ratios, projects with transform, and reports a real reconstruction MSE from evaluate. t-SNE builds perplexity-tuned high-dimensional affinities, symmetrizes the P matrix, applies a Student-t Q kernel, and descends with momentum and early exaggeration. DBSCAN::fit runs ε-neighborhood queries, core-point detection, BFS region growing, and noise labelling (-1) under Euclidean, Manhattan, or Cosine metrics. AgglomerativeClustering::fit performs bottom-up hierarchical clustering across all four linkages (Single / Complete / Average / Ward) over a precomputed pairwise distance matrix. compute_silhouette returns a real mean silhouette coefficient (intra-cluster a vs. nearest-cluster b). On the supervised side, LogisticRegression is fit by real IRLS honoring max_iter / tol / c / fit_intercept, with a real sigmoid predict/predict_proba and confusion-matrix evaluate (accuracy/precision/recall/F1), and cross_validate runs real k-fold. LinearRegression::cross_validate is likewise real k-fold. Anomaly detection got the same treatment: IsolationForest builds real isolation trees with random feature/value splits and 2^(-E[h]/c(n)) scores; LocalOutlierFactor::fit computes k-distances, reachability distances, and LRD ratios; OneClassSVM does real SVDD kernel-distance scoring with an RBF kernel. All three now expose a public labels field (-1/1). Scorer::RocAuc computes a real AUC via the Mann-Whitney U rank formula (ties handled; perfect = 1.0, random = 0.5).
2. The corrected statistical inference core.
P-values are now real. PandRS ships native chi2_sf (regularized incomplete gamma via Lanczos plus a Lentz continued fraction) and normal_sf (erfc approximation), and uses them to replace binary-threshold p-values across Ljung-Box, Box-Pierce, Breusch-Godfrey, Friedman, and Kruskal-Wallis with genuine χ²-distributed p-values. Kruskal-Wallis now computes its own H statistic; Shapiro-Wilk is improved via the Royston log-transform; and the erroneous ×0.95 scaling in Phillips-Perron is gone. Feature selection became data-dependent too: SelectKBest now computes real χ² contingency statistics and histogram-based mutual information, and select_features implements all four strategies for real (iterative OLS RFE, L1 coefficient ranking, tree-based variance×correlation, histogram MI). On top of that, a tier of descriptive statistics is available without any feature flag: skewness, kurtosis_excess, variance(data, ddof), std_dev(data, ddof), quantile(data, q) (NumPy-compatible linear interpolation), and anova_by_group(values, groups) one-way ANOVA F-statistic.
3. The deepened SciRS2 tier.
This is the Pure Rust alignment milestone. scirs2-core is now a non-optional core dependency (always-on) with the random feature; every direct rand and ndarray usage in production — including the GPU code in src/gpu/ and src/temporal/gpu.rs — was migrated to scirs2_core::random / scirs2_core::ndarray. The rand_compat.rs shim is retired, and rand + ndarray are no longer direct dependencies (they come in transitively via SciRS2-Core). The optional scirs2 feature, now on the SciRS2 0.5.0 line for stats and linalg, gained a substantial surface: Spearman rank correlation, sample covariance, paired t-test, chi-square goodness-of-fit and independence, Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, Shapiro-Wilk, and KS two-sample tests (with new Chi2TestResult / NormalityTestResult types); plus a full linalg set — QR, Cholesky, LU, least-squares lstsq, pseudoinverse pinv, matrix_norm, matrix_rank, and condition_number (with QrResult / LuResult / LstsqDataFrameResult). The SciRS2Ext DataFrame trait now exposes scirs2_spearman_corr, scirs2_cov, scirs2_qr, scirs2_lstsq, scirs2_matrix_rank, and scirs2_condition_number directly on your frames.
4. The platform & I/O maturation.
The JIT layer’s execute_expression_tree is now a full recursive interpreter over every ExpressionNode variant (Constant / Variable / BinaryOp / UnaryOp / Reduction / Conditional / FunctionCall / ArrayAccess) with scalar-broadcast semantics. AutoML’s create_estimator instantiates real SupervisedAdapter<M> wrappers for LinearRegression / DecisionTree / RandomForest / GradientBoosting via a new SupervisedAdapter that bridges SupervisedModel to SklearnPredictor. Model serving grew up: InMemoryModelRegistry::load_model returns Arc<dyn ModelServing> (storage moved Box → Arc for reference-counted sharing) and update_metadata is implemented. Under the distributed feature, ArrowConverter now handles a wide span of Arrow types (Int8/16/32, UInt8/16/32/64, Float32, LargeUtf8, Binary, Date32/64, Timestamp, Duration, List, LargeList, FixedSizeList, Struct, Dictionary), and the LocalConnector implements Parquet read/write powering read_parquet/write_parquet. MultiIndex gained pandas-style missing-value support: it accepts code == -1 as an NA sentinel, get_level_values() returns Vec<Option<T>>, and a new get_tuple_opt() gives per-level Option access.
Getting Started
cargo add pandrs
use pandrs::{DataFrame, Series};
use pandrs::ml::dimensionality::PCA; // real Jacobi-rotation PCA in 0.4.0
fn main() -> pandrs::error::Result<()> {
let mut df = DataFrame::new();
df.add_column(
"x".to_string(),
Series::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0], Some("x")),
)?;
df.add_column(
"y".to_string(),
Series::from_vec(vec![2.1, 3.9, 6.2, 8.1, 9.8], Some("y")),
)?;
// PCA now performs a real eigendecomposition (was a zero-component stub before 0.4.0)
let mut pca = PCA::new(1);
let projected = pca.fit_transform(&df)?;
println!(
"explained variance: {:?}",
pca.explained_variance_ratio()
);
println!("{} components projected", projected.shape().1);
Ok(())
}
Want to see the whole new engine at once? The release ships examples/ml_real_algorithms_example.rs — an end-to-end PCA / DBSCAN / LogisticRegression / IsolationForest / LOF / AgglomerativeClustering walkthrough with correctness assertions baked in.
What’s New in 0.4.0
Real ML algorithms
- PCA via real Jacobi-rotation eigendecomposition (covariance, real components, explained-variance ratios, reconstruction MSE) — was a zero-component stub.
- t-SNE with perplexity-tuned affinities, symmetrized P, Student-t Q kernel, momentum + early exaggeration — was an all-zeros embedding.
DBSCAN::fitreal density-based clustering (ε-neighborhoods, core points, BFS growth, noise labels; Euclidean/Manhattan/Cosine).AgglomerativeClustering::fitreal hierarchical clustering across Single/Complete/Average/Ward linkages.LogisticRegressionreal IRLS fit, real sigmoid predict/predict_proba, confusion-matrix metrics, real k-foldcross_validate.LinearRegression::cross_validatereal k-fold CV.IsolationForest,LocalOutlierFactor,OneClassSVMrebuilt with real algorithms — each gains a publiclabelsfield (-1/1).compute_silhouettereal mean silhouette;Scorer::RocAucreal Mann-Whitney U AUC.GridSearchCV/RandomizedSearchCVreal k-fold scoring;HyperparameterGridreal Cartesian product;learning_curve/validation_curvereal per-size/per-param CV.
Correct statistical inference
- Native
chi2_sf(Lanczos + Lentz) andnormal_sf(erfc) power real χ²-distributed p-values across Ljung-Box, Box-Pierce, Breusch-Godfrey, Friedman, Kruskal-Wallis. - Kruskal-Wallis computes its own H statistic; Shapiro-Wilk via Royston log-transform; removed the erroneous
×0.95in Phillips-Perron. SelectKBestreal χ² and histogram mutual-information scores;select_featuresreal for all four strategies.create_scaleractually returns RobustScaler (median/IQR), QuantileTransformer (rank-based), and PowerTransformer (Yeo-Johnson, optimal λ) — they no longer silently fall back to StandardScaler.- New always-available descriptive stats:
skewness,kurtosis_excess,variance(ddof),std_dev(ddof),quantile,anova_by_group.
Deeper SciRS2 integration
scirs2-coreis now a non-optional core dependency (randomfeature);rand+ndarrayare no longer direct deps — accessed viascirs2_core::random/scirs2_core::ndarray(GPU code included).scirs2feature on the SciRS2 0.5.0 line adds Spearman correlation, sample covariance, and hypothesis tests (paired t-test, χ² goodness-of-fit, χ² independence, Mann-Whitney U, Wilcoxon, Kruskal-Wallis, Shapiro-Wilk, KS two-sample).- SciRS2 linalg surface: QR, Cholesky, LU,
lstsq,pinv,matrix_norm,matrix_rank,condition_number. SciRS2ExtDataFrame trait extended withscirs2_spearman_corr,scirs2_cov,scirs2_qr,scirs2_lstsq,scirs2_matrix_rank,scirs2_condition_number.
Platform & I/O
- JIT
execute_expression_treefull recursive interpreter over allExpressionNodevariants. - AutoML
create_estimatorreturns realSupervisedAdapter<M>wrappers (Linear/DecisionTree/RandomForest/GradientBoosting). - Model serving
load_modelreturnsArc<dyn ModelServing>;update_metadataimplemented. distributed: expanded Arrow type coverage; Parquet read/write viaLocalConnector(read_parquet/write_parquet).MultiIndexNA sentinel support (code == -1,get_level_values() -> Vec<Option<T>>,get_tuple_opt()).oxiarc-archiveupdated to 0.3.2; deadsrc/index_impl/removed; metric-locking deadlock/race fixed in real-time analytics and streaming.
Tips
- Trust the new results. PCA, t-SNE, DBSCAN, and agglomerative clustering now compute real outputs — they are safe to use in analysis and downstream pipelines, not just demos.
- Read the new
labelsfields.IsolationForest,LocalOutlierFactor, andOneClassSVMeach expose a publiclabelsfield (-1= anomaly,1= inlier) — the simplest way to get a classification straight out offit. - Enable
scirs2for heavy math. Turn on thescirs2feature when you need QR / Cholesky / LU /lstsq/pinv/condition_numberor the extra hypothesis tests (Mann-Whitney U, Wilcoxon, KS two-sample, Shapiro-Wilk) — they hang directly off theSciRS2ExtDataFrame trait. - Descriptive stats need no feature flag.
skewness,kurtosis_excess,quantile,variance/std_dev(withddof), andanova_by_groupwork out of the box on aVec<f64>— noscirs2required. - There is no core flag to toggle anymore.
scirs2-coreis always-on in 0.4.0, so drop any oldscirs2-corefeature switches — and reach forrand/ndarraythroughscirs2_core::{random, ndarray}rather than adding them yourself. - Use
-1for missing index levels. AMultiIndexcode of-1is the pandas-style NA sentinel; pair it withget_level_values()(nowVec<Option<T>>) orget_tuple_opt()to handle gaps cleanly.
This is the foundation
PandRS is the DataFrame layer of the mature COOLJAPAN scientific stack, and 0.4.0 is where that role becomes load-bearing:
- NumRS2 — the NumPy-class N-dimensional array core.
- SciRS2 / SkleaRS — SciPy- and scikit-learn-class scientific computing and ML, now wired in hard: PandRS depends on SciRS2 0.5.0 for stats and linalg with
scirs2-corealways-on. - OptiRS — optimizers for the training and tuning loops feeding off your frames.
- OxiARC — Pure Rust compression (
oxiarc-archive0.3.2) underneath the I/O layer, in step with the COOLJAPAN no-C compression policy.
Together they form a broad, mature, C-free analytics stack: load a DataFrame, run real PCA or clustering, fit a real model, compute a real p-value, and serialize through Pure Rust compression — all in one static binary, with no Python interpreter and no native dependency chain to fight.
Repository: https://github.com/cool-japan/pandrs
Star the repo if you want a DataFrame whose analytics actually compute the answers they report — and do it without pandas, scikit-learn C-extensions, or the GIL.
The era of trusting a stub because it returned a plausible-looking number is over.
Pure Rust data analytics is here — fast, safe, correct, and sovereign.
— KitaSan at COOLJAPAN OÜ June 5, 2026