PandRS 0.4.0 Released — Where the DataFrame Finally Computes the Right Answer

A DataFrame that returns a number is easy. A DataFrame that returns the correct number is the whole job.

Today we released PandRS 0.4.0 — the release where PandRS’s analytics and machine learning genuinely compute correct answers. This is the biggest PandRS release to date: a deep, correctness-forward minor that rips out fabricated and stubbed results across the ML and statistics layers and replaces them with real, tested algorithms, while hard-wiring PandRS to the Pure Rust scientific stack through a now-mandatory SciRS2-Core dependency.

No C. No Cython. No pandas or scikit-learn/SciPy C-extensions. No Python GIL.
No conda environments, no platform-specific wheels, no native segfaults hiding under a friendly API.
Just a memory-safe, high-performance DataFrame that compiles to a single static binary (or WASM) and computes the same answer everywhere — laptop, server, edge, browser.

Why PandRS 0.4.0 is a game changer

There is a quiet trap in the analytics world: libraries that return numbers without actually computing them. A clustering routine that hands back an array of zeros. A logistic regression whose evaluate reports a hardcoded accuracy = 0.85. A ROC-AUC that is always 0.75. A silhouette score frozen at 0.75. A PCA that emits zero components. These pass a smoke test, they look plausible in a demo, and they are completely, silently wrong.

PandRS had its share of these stubs. 0.4.0 is the release that pays the debt. Every one of those placeholders has been replaced with a genuine implementation, validated by correctness assertions:

Real dimensionality reduction — PCA now performs an honest Jacobi-rotation eigendecomposition; t-SNE runs a real perplexity-tuned embedding with gradient descent.
Real clustering — DBSCAN does density-based region growing; agglomerative clustering does true bottom-up linkage; silhouette scores are actually computed.
Real supervised learning — logistic regression is fit by Iteratively Reweighted Least Squares with L2 regularization; metrics come from a real confusion matrix.
Real anomaly detection — isolation forests, Local Outlier Factor, and One-Class SVM now score and label genuinely.
Real statistical inference — chi-square, t, and F p-values are computed from real distribution tails rather than binary thresholds.

And underneath all of it, the data path is now Pure Rust end to end: DataFrame → statistics → linear algebra runs on SciRS2, with scirs2-core promoted to a non-optional core dependency. The whole chain is genuine, and it is sovereign.

Technical Deep Dive: Four Layers That Now Tell the Truth

1. The real ML engine (unsupervised + supervised).
The pandrs::ml layer was rebuilt around algorithms, not stand-ins. PCA computes the covariance matrix, decomposes it via Jacobi rotations, stores real principal components and explained-variance ratios, projects with transform, and reports a real reconstruction MSE from evaluate. t-SNE builds perplexity-tuned high-dimensional affinities, symmetrizes the P matrix, applies a Student-t Q kernel, and descends with momentum and early exaggeration. DBSCAN::fit runs ε-neighborhood queries, core-point detection, BFS region growing, and noise labelling (-1) under Euclidean, Manhattan, or Cosine metrics. AgglomerativeClustering::fit performs bottom-up hierarchical clustering across all four linkages (Single / Complete / Average / Ward) over a precomputed pairwise distance matrix. compute_silhouette returns a real mean silhouette coefficient (intra-cluster a vs. nearest-cluster b). On the supervised side, LogisticRegression is fit by real IRLS honoring max_iter / tol / c / fit_intercept, with a real sigmoid predict/predict_proba and confusion-matrix evaluate (accuracy/precision/recall/F1), and cross_validate runs real k-fold. LinearRegression::cross_validate is likewise real k-fold. Anomaly detection got the same treatment: IsolationForest builds real isolation trees with random feature/value splits and 2^(-E[h]/c(n)) scores; LocalOutlierFactor::fit computes k-distances, reachability distances, and LRD ratios; OneClassSVM does real SVDD kernel-distance scoring with an RBF kernel. All three now expose a public labels field (-1/1). Scorer::RocAuc computes a real AUC via the Mann-Whitney U rank formula (ties handled; perfect = 1.0, random = 0.5).

2. The corrected statistical inference core.
P-values are now real. PandRS ships native chi2_sf (regularized incomplete gamma via Lanczos plus a Lentz continued fraction) and normal_sf (erfc approximation), and uses them to replace binary-threshold p-values across Ljung-Box, Box-Pierce, Breusch-Godfrey, Friedman, and Kruskal-Wallis with genuine χ²-distributed p-values. Kruskal-Wallis now computes its own H statistic; Shapiro-Wilk is improved via the Royston log-transform; and the erroneous ×0.95 scaling in Phillips-Perron is gone. Feature selection became data-dependent too: SelectKBest now computes real χ² contingency statistics and histogram-based mutual information, and select_features implements all four strategies for real (iterative OLS RFE, L1 coefficient ranking, tree-based variance×correlation, histogram MI). On top of that, a tier of descriptive statistics is available without any feature flag: skewness, kurtosis_excess, variance(data, ddof), std_dev(data, ddof), quantile(data, q) (NumPy-compatible linear interpolation), and anova_by_group(values, groups) one-way ANOVA F-statistic.

3. The deepened SciRS2 tier.
This is the Pure Rust alignment milestone. scirs2-core is now a non-optional core dependency (always-on) with the random feature; every direct rand and ndarray usage in production — including the GPU code in src/gpu/ and src/temporal/gpu.rs — was migrated to scirs2_core::random / scirs2_core::ndarray. The rand_compat.rs shim is retired, and rand + ndarray are no longer direct dependencies (they come in transitively via SciRS2-Core). The optional scirs2 feature, now on the SciRS2 0.5.0 line for stats and linalg, gained a substantial surface: Spearman rank correlation, sample covariance, paired t-test, chi-square goodness-of-fit and independence, Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, Shapiro-Wilk, and KS two-sample tests (with new Chi2TestResult / NormalityTestResult types); plus a full linalg set — QR, Cholesky, LU, least-squares lstsq, pseudoinverse pinv, matrix_norm, matrix_rank, and condition_number (with QrResult / LuResult / LstsqDataFrameResult). The SciRS2Ext DataFrame trait now exposes scirs2_spearman_corr, scirs2_cov, scirs2_qr, scirs2_lstsq, scirs2_matrix_rank, and scirs2_condition_number directly on your frames.

4. The platform & I/O maturation.
The JIT layer’s execute_expression_tree is now a full recursive interpreter over every ExpressionNode variant (Constant / Variable / BinaryOp / UnaryOp / Reduction / Conditional / FunctionCall / ArrayAccess) with scalar-broadcast semantics. AutoML’s create_estimator instantiates real SupervisedAdapter<M> wrappers for LinearRegression / DecisionTree / RandomForest / GradientBoosting via a new SupervisedAdapter that bridges SupervisedModel to SklearnPredictor. Model serving grew up: InMemoryModelRegistry::load_model returns Arc<dyn ModelServing> (storage moved Box → Arc for reference-counted sharing) and update_metadata is implemented. Under the distributed feature, ArrowConverter now handles a wide span of Arrow types (Int8/16/32, UInt8/16/32/64, Float32, LargeUtf8, Binary, Date32/64, Timestamp, Duration, List, LargeList, FixedSizeList, Struct, Dictionary), and the LocalConnector implements Parquet read/write powering read_parquet/write_parquet. MultiIndex gained pandas-style missing-value support: it accepts code == -1 as an NA sentinel, get_level_values() returns Vec<Option<T>>, and a new get_tuple_opt() gives per-level Option access.

Getting Started

cargo add pandrs

use pandrs::{DataFrame, Series};
use pandrs::ml::dimensionality::PCA; // real Jacobi-rotation PCA in 0.4.0

fn main() -> pandrs::error::Result<()> {
    let mut df = DataFrame::new();
    df.add_column(
        "x".to_string(),
        Series::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0], Some("x")),
    )?;
    df.add_column(
        "y".to_string(),
        Series::from_vec(vec![2.1, 3.9, 6.2, 8.1, 9.8], Some("y")),
    )?;

    // PCA now performs a real eigendecomposition (was a zero-component stub before 0.4.0)
    let mut pca = PCA::new(1);
    let projected = pca.fit_transform(&df)?;
    println!(
        "explained variance: {:?}",
        pca.explained_variance_ratio()
    );
    println!("{} components projected", projected.shape().1);
    Ok(())
}

Want to see the whole new engine at once? The release ships examples/ml_real_algorithms_example.rs — an end-to-end PCA / DBSCAN / LogisticRegression / IsolationForest / LOF / AgglomerativeClustering walkthrough with correctness assertions baked in.

What’s New in 0.4.0

Real ML algorithms

PCA via real Jacobi-rotation eigendecomposition (covariance, real components, explained-variance ratios, reconstruction MSE) — was a zero-component stub.
t-SNE with perplexity-tuned affinities, symmetrized P, Student-t Q kernel, momentum + early exaggeration — was an all-zeros embedding.
DBSCAN::fit real density-based clustering (ε-neighborhoods, core points, BFS growth, noise labels; Euclidean/Manhattan/Cosine).
AgglomerativeClustering::fit real hierarchical clustering across Single/Complete/Average/Ward linkages.
LogisticRegression real IRLS fit, real sigmoid predict/predict_proba, confusion-matrix metrics, real k-fold cross_validate.
LinearRegression::cross_validate real k-fold CV.
IsolationForest, LocalOutlierFactor, OneClassSVM rebuilt with real algorithms — each gains a public labels field (-1/1).
compute_silhouette real mean silhouette; Scorer::RocAuc real Mann-Whitney U AUC.
GridSearchCV / RandomizedSearchCV real k-fold scoring; HyperparameterGrid real Cartesian product; learning_curve / validation_curve real per-size/per-param CV.

Correct statistical inference

Native chi2_sf (Lanczos + Lentz) and normal_sf (erfc) power real χ²-distributed p-values across Ljung-Box, Box-Pierce, Breusch-Godfrey, Friedman, Kruskal-Wallis.
Kruskal-Wallis computes its own H statistic; Shapiro-Wilk via Royston log-transform; removed the erroneous ×0.95 in Phillips-Perron.
SelectKBest real χ² and histogram mutual-information scores; select_features real for all four strategies.
create_scaler actually returns RobustScaler (median/IQR), QuantileTransformer (rank-based), and PowerTransformer (Yeo-Johnson, optimal λ) — they no longer silently fall back to StandardScaler.
New always-available descriptive stats: skewness, kurtosis_excess, variance(ddof), std_dev(ddof), quantile, anova_by_group.

Deeper SciRS2 integration

scirs2-core is now a non-optional core dependency (random feature); rand + ndarray are no longer direct deps — accessed via scirs2_core::random / scirs2_core::ndarray (GPU code included).
scirs2 feature on the SciRS2 0.5.0 line adds Spearman correlation, sample covariance, and hypothesis tests (paired t-test, χ² goodness-of-fit, χ² independence, Mann-Whitney U, Wilcoxon, Kruskal-Wallis, Shapiro-Wilk, KS two-sample).
SciRS2 linalg surface: QR, Cholesky, LU, lstsq, pinv, matrix_norm, matrix_rank, condition_number.
SciRS2Ext DataFrame trait extended with scirs2_spearman_corr, scirs2_cov, scirs2_qr, scirs2_lstsq, scirs2_matrix_rank, scirs2_condition_number.

Platform & I/O

JIT execute_expression_tree full recursive interpreter over all ExpressionNode variants.
AutoML create_estimator returns real SupervisedAdapter<M> wrappers (Linear/DecisionTree/RandomForest/GradientBoosting).
Model serving load_model returns Arc<dyn ModelServing>; update_metadata implemented.
distributed: expanded Arrow type coverage; Parquet read/write via LocalConnector (read_parquet/write_parquet).
MultiIndex NA sentinel support (code == -1, get_level_values() -> Vec<Option<T>>, get_tuple_opt()).
oxiarc-archive updated to 0.3.2; dead src/index_impl/ removed; metric-locking deadlock/race fixed in real-time analytics and streaming.

Tips

Trust the new results. PCA, t-SNE, DBSCAN, and agglomerative clustering now compute real outputs — they are safe to use in analysis and downstream pipelines, not just demos.
Read the new labels fields. IsolationForest, LocalOutlierFactor, and OneClassSVM each expose a public labels field (-1 = anomaly, 1 = inlier) — the simplest way to get a classification straight out of fit.
Enable scirs2 for heavy math. Turn on the scirs2 feature when you need QR / Cholesky / LU / lstsq / pinv / condition_number or the extra hypothesis tests (Mann-Whitney U, Wilcoxon, KS two-sample, Shapiro-Wilk) — they hang directly off the SciRS2Ext DataFrame trait.
Descriptive stats need no feature flag. skewness, kurtosis_excess, quantile, variance/std_dev (with ddof), and anova_by_group work out of the box on a Vec<f64> — no scirs2 required.
There is no core flag to toggle anymore. scirs2-core is always-on in 0.4.0, so drop any old scirs2-core feature switches — and reach for rand/ndarray through scirs2_core::{random, ndarray} rather than adding them yourself.
Use -1 for missing index levels. A MultiIndex code of -1 is the pandas-style NA sentinel; pair it with get_level_values() (now Vec<Option<T>>) or get_tuple_opt() to handle gaps cleanly.

This is the foundation

PandRS is the DataFrame layer of the mature COOLJAPAN scientific stack, and 0.4.0 is where that role becomes load-bearing:

NumRS2 — the NumPy-class N-dimensional array core.
SciRS2 / SkleaRS — SciPy- and scikit-learn-class scientific computing and ML, now wired in hard: PandRS depends on SciRS2 0.5.0 for stats and linalg with scirs2-core always-on.
OptiRS — optimizers for the training and tuning loops feeding off your frames.
OxiARC — Pure Rust compression (oxiarc-archive 0.3.2) underneath the I/O layer, in step with the COOLJAPAN no-C compression policy.

Together they form a broad, mature, C-free analytics stack: load a DataFrame, run real PCA or clustering, fit a real model, compute a real p-value, and serialize through Pure Rust compression — all in one static binary, with no Python interpreter and no native dependency chain to fight.

Repository: https://github.com/cool-japan/pandrs

Star the repo if you want a DataFrame whose analytics actually compute the answers they report — and do it without pandas, scikit-learn C-extensions, or the GIL.

The era of trusting a stub because it returned a plausible-looking number is over.

Pure Rust data analytics is here — fast, safe, correct, and sovereign.

— KitaSan at COOLJAPAN OÜ June 5, 2026