Frozen Benchmark Suites

DoTime ships four versioned, immutable suites for reproducible evaluation. Each has a Zenodo DOI and Croissant metadata.

Suites

dot-Identifiability-v1 — ~10.8k trajectories across eight named structures: back_door, observed_confounder, confounder_mediator (back-door family); front_door, mediator (front-door family); instrumental_variable (IV); rct_no_confounding (trivially identified); unobserved_confounder (non-identifiable, robustness check). Counterfactuals are exact.
dot-RegimeSwitch-v1 — regime-switching trajectories with controllable break density.
dot-Continuous-v1 — continuous-time intervention windows, multiple query offsets.
dot-Generic-100k — 100 000 trajectories from the full diverse prior. Training-scale.

Loader

from dotime.benchmarks import load_benchmark

suite = load_benchmark("dot-Identifiability-v1", version="1.0.0")

On first access the suite is fetched into ~/.cache/dotime/ — from the Hugging Face mirror (thummd/dot-*) by default, falling back to the Zenodo archive of record (DOIs 10.5281/zenodo.20846064, .20846074, .20845981, .20845983) — and md5-verified against the manifest. Pass force_download=True to re-fetch. Override the cache with $DOTIME_CACHE or cache_dir=.

Evaluation protocol

The default evaluation reports RMSE, NMSE, MAE, direction accuracy, lift-over-naive, and effect-error correlation, computed per-structure and pooled.

from dotime.evaluation import evaluate

results = evaluate(model, suite)

See the API Reference reference for the full benchmarks, baselines, and evaluation module documentation.