evaluation

Evaluation harness for DoTime benchmark suites.

This module ports the metric functions and aggregation helpers from the Do-Over-Time-PFN evaluation code (dotime/eval/metrics.py and the scripts/tscm_identifiability.py reference harness) into a single dependency-light surface (torch + numpy only — R² is computed directly rather than via scikit-learn so it stays in the core install).

Public surface

metric functions: compute_rmse(), compute_mae(), compute_nmse(), compute_r2().
direction_accuracy() — sign-consistent accuracy, near-zero targets excluded.
bootstrap_ci() — bootstrap mean/std/CI over per-sample values.
evaluate() — run a baseline over a suite, aggregating pooled and per-structure metrics.
Results — holds the aggregated metrics with .summary() and .to_dict().

class dotime.evaluation.Results(suite, baseline, n_episodes, n_queries, pooled, per_structure=<factory>)[source]

Bases: object

Aggregated evaluation results for one baseline on one suite.

Parameters:

suite (str)
baseline (str)
n_episodes (int)
n_queries (int)
pooled (dict[str, float])
per_structure (dict[str, dict[str, float]])

suite: str

baseline: str

n_episodes: int

n_queries: int

pooled: dict[str, float]

per_structure: dict[str, dict[str, float]]

to_dict()[source]

JSON-serializable view of the results.

Return type:: dict

summary()[source]

Human-readable results table.

Return type:: str

dotime.evaluation.bootstrap_ci(values, n=1000, alpha=0.05, seed=0)[source]

Bootstrap (mean, std, ci_low, ci_high) over per-sample values.

Uses the percentile method at confidence 1 - alpha. Returns NaNs for an empty input; a degenerate (v, 0, v, v) for a single value.

Return type:

tuple[float, float, float, float]

Parameters:

values (Iterable[float])
n (int)
alpha (float)
seed (int)

dotime.evaluation.compute_mae(predictions, targets)[source]

Mean absolute error.

Return type:

float

Parameters:

predictions (Tensor)
targets (Tensor)

dotime.evaluation.compute_nmse(predictions, targets)[source]

Normalized MSE: MSE / Var(targets).

Equals 1.0 for a predict-the-mean baseline, <1.0 when better, >1.0 worse. Returns NaN when there are fewer than two targets or the variance is ~0.

Return type:

float

Parameters:

predictions (Tensor)
targets (Tensor)

dotime.evaluation.compute_r2(predictions, targets)[source]

Coefficient of determination, 1 - SS_res / SS_tot.

Computed directly (no scikit-learn) so it stays in the core install. Returns NaN when the target variance is ~0.

Return type:

float

Parameters:

predictions (Tensor)
targets (Tensor)

dotime.evaluation.compute_rmse(predictions, targets)[source]

Root mean squared error.

Return type:

float

Parameters:

predictions (Tensor)
targets (Tensor)

dotime.evaluation.direction_accuracy(preds, targets, eps=0.1)[source]

Sign-consistent direction accuracy, excluding near-zero targets.

Returns a dict with accuracy (fraction of |target| >= eps samples whose predicted sign matches), n_valid and n_excluded.

Return type:

dict[str, float | int]

Parameters:

preds (Tensor)
targets (Tensor)
eps (float)

dotime.evaluation.evaluate(model, suite, metrics=None)[source]

Evaluate a baseline over every episode of a suite.

Calls model.predict(episode) for each episode, pools predictions and ground-truth targets across all queries, and reports pooled and per-structure metrics.

Return type:

Results

Parameters:

model (Baseline)
suite (BenchmarkSuite)
metrics (dict[str, Callable[[Tensor, Tensor], float]] | None)