Pointwise Tasks
Predict an outcome for a single ligand-target pair (classification, regression, or ordinal rank).
3 tasks
Eight evaluation tasks spanning pointwise prediction, pairwise comparison, and listwise SAR winner identification.
Each task is designed to mirror a concrete decision point in a medicinal chemistry programme where predicting binding affinity is useful.
Instead of optimizing abstract benchmark scores, the task metrics are chosen to be interpretable as decision utility: hit enrichment, pairwise decision correctness, and winner-selection quality.
In short: the benchmark asks whether a model helps choose better compounds to make and test at the right stage of discovery.
The benchmark uses three classes of tasks. Together they test whether models can score single compounds, compare alternatives directly, and prioritize whole analogue sets.
Predict an outcome for a single ligand-target pair (classification, regression, or ordinal rank).
3 tasks
Compare two ligands directly and decide which one is better on affinity, LLE, or ordinal rank.
3 tasks
Choose the best candidate from a set of analogues, mirroring medicinal chemistry next-step decisions.
2 tasks
Predict an outcome for a single ligand-target pair (classification, regression, or ordinal rank).
Predict whether a ligand binds the target.
Metrics: AUROC, AUPRC, EF 1%, EF 5%
When important: Early hit-finding and triage, when many virtual or designed compounds must be filtered before assay spend.
Simulates: A go/no-go call on whether a ligand is likely to bind strongly enough to be worth progressing.
Metric utility: AUROC/AUPRC/EF report how well actives are enriched near the top, which is what matters for practical screening queues.
Predict continuous binding affinity (pX).
Metrics: R², RMSE, Spearman ρ
When important: Hit-to-lead and lead optimisation, when teams want estimates of absolute potency changes, not only rank order.
Simulates: Forecasting expected pX for a proposed analogue before synthesis and testing.
Metric utility: R² and RMSE measure calibration quality, while Spearman checks whether useful ordering is preserved.
Predict potency bin (rank 1–5).
Metrics: Exact Accuracy, Adjacent Accuracy, Concordance Index, Kendall τ
When important: Program stages where assay outputs are bucketed (for example patent-style bins) rather than precise IC50 values.
Simulates: Choosing compounds using coarse potency brackets when exact numerical readouts are unavailable.
Metric utility: Exact and adjacent accuracy are interpretable for medicinal chemistry triage; C-index and Kendall tau assess ordering quality.
Compare two ligands directly and decide which one is better on affinity, LLE, or ordinal rank.
Which ligand binds more strongly?
Metrics: Pairwise Accuracy, Kendall τ
When important: Routine analogue prioritisation when chemists compare two close modifications in the same series.
Simulates: The direct decision: which of these two compounds should be made or tested first for potency.
Metric utility: Pairwise accuracy directly maps to decision correctness for binary choose-A-or-B comparisons.
Which ligand has better lipophilic efficiency?
Metrics: Pairwise Accuracy, Kendall τ
When important: Lead optimisation when potency must be balanced against lipophilicity and developability risk.
Simulates: Choosing between two analogues where one may be potent but too greasy and the other more efficient.
Metric utility: Pairwise LLE accuracy reveals whether the model can prioritize potency efficiency rather than raw potency alone.
Which ligand has a better activity rank?
Metrics: Pairwise Accuracy
When important: SAR table review using bucketed activity labels, especially in early and mid-cycle design reviews.
Simulates: A quick side-by-side call between two analogues when only ordinal strength information is available.
Metric utility: Pairwise accuracy is easy to interpret as percent correct in realistic head-to-head ranking decisions.
Choose the best candidate from a set of analogues, mirroring medicinal chemistry next-step decisions.
Identify the potency-improving modification.
Metrics: Top-1 Accuracy, Kendall τ
When important: Hit-to-lead and lead optimisation planning, when selecting a small set of analogues to synthesize next.
Simulates: Selecting the likely best analogue from a mini design set under practical make/test constraints.
Metric utility: Top-1 winner accuracy reflects whether the model picks the same next compound a chemist would want to prioritize.
Identify the LLE-improving modification.
Metrics: Top-1 Accuracy, Kendall τ
When important: Later lead optimisation when multi-parameter quality matters and potency alone is not enough.
Simulates: Picking the best analogue under potency-efficiency pressure (harder but closer to real project utility).
Metric utility: Top-1 LLE winner accuracy and rank correlation show whether recommendations remain useful under stricter optimisation objectives.