COOLJAPAN
← All posts

PandRS 0.4.0 Released — Where the DataFrame Finally Computes the Right Answer

The high-performance Pure Rust DataFrame library reaches 0.4.0 — a correctness landmark. PandRS replaces a large swath of fabricated and stubbed ML and statistical results with real algorithms: Jacobi-rotation PCA, t-SNE, DBSCAN, agglomerative clustering, IRLS logistic regression, isolation-forest/LOF/OneClassSVM anomaly detection, real silhouette and ROC-AUC, and genuine chi-square/t/F p-values. SciRS2-Core becomes a non-optional core dependency, and the SciRS2 stats/linalg integration deepens onto the 0.5 line. The DataFrame layer of the COOLJAPAN scientific stack — no pandas, no scikit-learn C-extensions, no GIL.

release pandrs dataframe machine-learning statistics scirs2 pure-rust pca clustering

A DataFrame that returns a number is easy. A DataFrame that returns the correct number is the whole job.

Today we released PandRS 0.4.0 — the release where PandRS’s analytics and machine learning genuinely compute correct answers. This is the biggest PandRS release to date: a deep, correctness-forward minor that rips out fabricated and stubbed results across the ML and statistics layers and replaces them with real, tested algorithms, while hard-wiring PandRS to the Pure Rust scientific stack through a now-mandatory SciRS2-Core dependency.

No C. No Cython. No pandas or scikit-learn/SciPy C-extensions. No Python GIL.
No conda environments, no platform-specific wheels, no native segfaults hiding under a friendly API.
Just a memory-safe, high-performance DataFrame that compiles to a single static binary (or WASM) and computes the same answer everywhere — laptop, server, edge, browser.

Why PandRS 0.4.0 is a game changer

There is a quiet trap in the analytics world: libraries that return numbers without actually computing them. A clustering routine that hands back an array of zeros. A logistic regression whose evaluate reports a hardcoded accuracy = 0.85. A ROC-AUC that is always 0.75. A silhouette score frozen at 0.75. A PCA that emits zero components. These pass a smoke test, they look plausible in a demo, and they are completely, silently wrong.

PandRS had its share of these stubs. 0.4.0 is the release that pays the debt. Every one of those placeholders has been replaced with a genuine implementation, validated by correctness assertions:

And underneath all of it, the data path is now Pure Rust end to end: DataFrame → statistics → linear algebra runs on SciRS2, with scirs2-core promoted to a non-optional core dependency. The whole chain is genuine, and it is sovereign.

Technical Deep Dive: Four Layers That Now Tell the Truth

1. The real ML engine (unsupervised + supervised).
The pandrs::ml layer was rebuilt around algorithms, not stand-ins. PCA computes the covariance matrix, decomposes it via Jacobi rotations, stores real principal components and explained-variance ratios, projects with transform, and reports a real reconstruction MSE from evaluate. t-SNE builds perplexity-tuned high-dimensional affinities, symmetrizes the P matrix, applies a Student-t Q kernel, and descends with momentum and early exaggeration. DBSCAN::fit runs ε-neighborhood queries, core-point detection, BFS region growing, and noise labelling (-1) under Euclidean, Manhattan, or Cosine metrics. AgglomerativeClustering::fit performs bottom-up hierarchical clustering across all four linkages (Single / Complete / Average / Ward) over a precomputed pairwise distance matrix. compute_silhouette returns a real mean silhouette coefficient (intra-cluster a vs. nearest-cluster b). On the supervised side, LogisticRegression is fit by real IRLS honoring max_iter / tol / c / fit_intercept, with a real sigmoid predict/predict_proba and confusion-matrix evaluate (accuracy/precision/recall/F1), and cross_validate runs real k-fold. LinearRegression::cross_validate is likewise real k-fold. Anomaly detection got the same treatment: IsolationForest builds real isolation trees with random feature/value splits and 2^(-E[h]/c(n)) scores; LocalOutlierFactor::fit computes k-distances, reachability distances, and LRD ratios; OneClassSVM does real SVDD kernel-distance scoring with an RBF kernel. All three now expose a public labels field (-1/1). Scorer::RocAuc computes a real AUC via the Mann-Whitney U rank formula (ties handled; perfect = 1.0, random = 0.5).

2. The corrected statistical inference core.
P-values are now real. PandRS ships native chi2_sf (regularized incomplete gamma via Lanczos plus a Lentz continued fraction) and normal_sf (erfc approximation), and uses them to replace binary-threshold p-values across Ljung-Box, Box-Pierce, Breusch-Godfrey, Friedman, and Kruskal-Wallis with genuine χ²-distributed p-values. Kruskal-Wallis now computes its own H statistic; Shapiro-Wilk is improved via the Royston log-transform; and the erroneous ×0.95 scaling in Phillips-Perron is gone. Feature selection became data-dependent too: SelectKBest now computes real χ² contingency statistics and histogram-based mutual information, and select_features implements all four strategies for real (iterative OLS RFE, L1 coefficient ranking, tree-based variance×correlation, histogram MI). On top of that, a tier of descriptive statistics is available without any feature flag: skewness, kurtosis_excess, variance(data, ddof), std_dev(data, ddof), quantile(data, q) (NumPy-compatible linear interpolation), and anova_by_group(values, groups) one-way ANOVA F-statistic.

3. The deepened SciRS2 tier.
This is the Pure Rust alignment milestone. scirs2-core is now a non-optional core dependency (always-on) with the random feature; every direct rand and ndarray usage in production — including the GPU code in src/gpu/ and src/temporal/gpu.rs — was migrated to scirs2_core::random / scirs2_core::ndarray. The rand_compat.rs shim is retired, and rand + ndarray are no longer direct dependencies (they come in transitively via SciRS2-Core). The optional scirs2 feature, now on the SciRS2 0.5.0 line for stats and linalg, gained a substantial surface: Spearman rank correlation, sample covariance, paired t-test, chi-square goodness-of-fit and independence, Mann-Whitney U, Wilcoxon signed-rank, Kruskal-Wallis, Shapiro-Wilk, and KS two-sample tests (with new Chi2TestResult / NormalityTestResult types); plus a full linalg set — QR, Cholesky, LU, least-squares lstsq, pseudoinverse pinv, matrix_norm, matrix_rank, and condition_number (with QrResult / LuResult / LstsqDataFrameResult). The SciRS2Ext DataFrame trait now exposes scirs2_spearman_corr, scirs2_cov, scirs2_qr, scirs2_lstsq, scirs2_matrix_rank, and scirs2_condition_number directly on your frames.

4. The platform & I/O maturation.
The JIT layer’s execute_expression_tree is now a full recursive interpreter over every ExpressionNode variant (Constant / Variable / BinaryOp / UnaryOp / Reduction / Conditional / FunctionCall / ArrayAccess) with scalar-broadcast semantics. AutoML’s create_estimator instantiates real SupervisedAdapter<M> wrappers for LinearRegression / DecisionTree / RandomForest / GradientBoosting via a new SupervisedAdapter that bridges SupervisedModel to SklearnPredictor. Model serving grew up: InMemoryModelRegistry::load_model returns Arc<dyn ModelServing> (storage moved BoxArc for reference-counted sharing) and update_metadata is implemented. Under the distributed feature, ArrowConverter now handles a wide span of Arrow types (Int8/16/32, UInt8/16/32/64, Float32, LargeUtf8, Binary, Date32/64, Timestamp, Duration, List, LargeList, FixedSizeList, Struct, Dictionary), and the LocalConnector implements Parquet read/write powering read_parquet/write_parquet. MultiIndex gained pandas-style missing-value support: it accepts code == -1 as an NA sentinel, get_level_values() returns Vec<Option<T>>, and a new get_tuple_opt() gives per-level Option access.

Getting Started

cargo add pandrs
use pandrs::{DataFrame, Series};
use pandrs::ml::dimensionality::PCA; // real Jacobi-rotation PCA in 0.4.0

fn main() -> pandrs::error::Result<()> {
    let mut df = DataFrame::new();
    df.add_column(
        "x".to_string(),
        Series::from_vec(vec![1.0, 2.0, 3.0, 4.0, 5.0], Some("x")),
    )?;
    df.add_column(
        "y".to_string(),
        Series::from_vec(vec![2.1, 3.9, 6.2, 8.1, 9.8], Some("y")),
    )?;

    // PCA now performs a real eigendecomposition (was a zero-component stub before 0.4.0)
    let mut pca = PCA::new(1);
    let projected = pca.fit_transform(&df)?;
    println!(
        "explained variance: {:?}",
        pca.explained_variance_ratio()
    );
    println!("{} components projected", projected.shape().1);
    Ok(())
}

Want to see the whole new engine at once? The release ships examples/ml_real_algorithms_example.rs — an end-to-end PCA / DBSCAN / LogisticRegression / IsolationForest / LOF / AgglomerativeClustering walkthrough with correctness assertions baked in.

What’s New in 0.4.0

Real ML algorithms

Correct statistical inference

Deeper SciRS2 integration

Platform & I/O

Tips

This is the foundation

PandRS is the DataFrame layer of the mature COOLJAPAN scientific stack, and 0.4.0 is where that role becomes load-bearing:

Together they form a broad, mature, C-free analytics stack: load a DataFrame, run real PCA or clustering, fit a real model, compute a real p-value, and serialize through Pure Rust compression — all in one static binary, with no Python interpreter and no native dependency chain to fight.

Repository: https://github.com/cool-japan/pandrs

Star the repo if you want a DataFrame whose analytics actually compute the answers they report — and do it without pandas, scikit-learn C-extensions, or the GIL.

The era of trusting a stub because it returned a plausible-looking number is over.

Pure Rust data analytics is here — fast, safe, correct, and sovereign.

KitaSan at COOLJAPAN OÜ June 5, 2026

↑ Back to all posts