evaluation
Evaluation harness for DoTime benchmark suites.
This module ports the metric functions and aggregation helpers from the
Do-Over-Time-PFN evaluation code (dotime/eval/metrics.py and the
scripts/tscm_identifiability.py reference harness) into a single
dependency-light surface (torch + numpy only — R² is computed directly rather
than via scikit-learn so it stays in the core install).
Public surface
metric functions:
compute_rmse(),compute_mae(),compute_nmse(),compute_r2().direction_accuracy()— sign-consistent accuracy, near-zero targets excluded.bootstrap_ci()— bootstrap mean/std/CI over per-sample values.evaluate()— run a baseline over a suite, aggregating pooled and per-structure metrics.Results— holds the aggregated metrics with.summary()and.to_dict().
- class dotime.evaluation.Results(suite, baseline, n_episodes, n_queries, pooled, per_structure=<factory>)[source]
Bases:
objectAggregated evaluation results for one baseline on one suite.
- Parameters:
- dotime.evaluation.bootstrap_ci(values, n=1000, alpha=0.05, seed=0)[source]
Bootstrap
(mean, std, ci_low, ci_high)over per-sample values.Uses the percentile method at confidence
1 - alpha. Returns NaNs for an empty input; a degenerate(v, 0, v, v)for a single value.
- dotime.evaluation.compute_nmse(predictions, targets)[source]
Normalized MSE:
MSE / Var(targets).Equals 1.0 for a predict-the-mean baseline, <1.0 when better, >1.0 worse. Returns NaN when there are fewer than two targets or the variance is ~0.
- dotime.evaluation.compute_r2(predictions, targets)[source]
Coefficient of determination,
1 - SS_res / SS_tot.Computed directly (no scikit-learn) so it stays in the core install. Returns NaN when the target variance is ~0.
- dotime.evaluation.direction_accuracy(preds, targets, eps=0.1)[source]
Sign-consistent direction accuracy, excluding near-zero targets.
Returns a dict with
accuracy(fraction of|target| >= epssamples whose predicted sign matches),n_validandn_excluded.