evaluation

Evaluation harness for DoTime benchmark suites.

This module ports the metric functions and aggregation helpers from the Do-Over-Time-PFN evaluation code (dotime/eval/metrics.py and the scripts/tscm_identifiability.py reference harness) into a single dependency-light surface (torch + numpy only — R² is computed directly rather than via scikit-learn so it stays in the core install).

Public surface

class dotime.evaluation.Results(suite, baseline, n_episodes, n_queries, pooled, per_structure=<factory>)[source]

Bases: object

Aggregated evaluation results for one baseline on one suite.

Parameters:
suite: str
baseline: str
n_episodes: int
n_queries: int
pooled: dict[str, float]
per_structure: dict[str, dict[str, float]]
to_dict()[source]

JSON-serializable view of the results.

Return type:

dict

summary()[source]

Human-readable results table.

Return type:

str

dotime.evaluation.bootstrap_ci(values, n=1000, alpha=0.05, seed=0)[source]

Bootstrap (mean, std, ci_low, ci_high) over per-sample values.

Uses the percentile method at confidence 1 - alpha. Returns NaNs for an empty input; a degenerate (v, 0, v, v) for a single value.

Return type:

tuple[float, float, float, float]

Parameters:
dotime.evaluation.compute_mae(predictions, targets)[source]

Mean absolute error.

Return type:

float

Parameters:
dotime.evaluation.compute_nmse(predictions, targets)[source]

Normalized MSE: MSE / Var(targets).

Equals 1.0 for a predict-the-mean baseline, <1.0 when better, >1.0 worse. Returns NaN when there are fewer than two targets or the variance is ~0.

Return type:

float

Parameters:
dotime.evaluation.compute_r2(predictions, targets)[source]

Coefficient of determination, 1 - SS_res / SS_tot.

Computed directly (no scikit-learn) so it stays in the core install. Returns NaN when the target variance is ~0.

Return type:

float

Parameters:
dotime.evaluation.compute_rmse(predictions, targets)[source]

Root mean squared error.

Return type:

float

Parameters:
dotime.evaluation.direction_accuracy(preds, targets, eps=0.1)[source]

Sign-consistent direction accuracy, excluding near-zero targets.

Returns a dict with accuracy (fraction of |target| >= eps samples whose predicted sign matches), n_valid and n_excluded.

Return type:

dict[str, float | int]

Parameters:
dotime.evaluation.evaluate(model, suite, metrics=None)[source]

Evaluate a baseline over every episode of a suite.

Calls model.predict(episode) for each episode, pools predictions and ground-truth targets across all queries, and reports pooled and per-structure metrics.

Return type:

Results

Parameters: