Benchmark Tasks

Eight evaluation tasks spanning pointwise prediction, pairwise comparison, and listwise SAR winner identification.

Why these tasks exist

Each task is designed to mirror a concrete decision point in a medicinal chemistry programme where predicting binding affinity is useful.

Instead of optimizing abstract benchmark scores, the task metrics are chosen to be interpretable as decision utility: hit enrichment, pairwise decision correctness, and winner-selection quality.

In short: the benchmark asks whether a model helps choose better compounds to make and test at the right stage of discovery.

Task Classes

The benchmark uses three classes of tasks. Together they test whether models can score single compounds, compare alternatives directly, and prioritize whole analogue sets.

Pointwise Tasks

Predict an outcome for a single ligand-target pair (classification or regression).

2 tasks

Pairwise Tasks

Compare two ligands directly and decide which one is better on affinity, LLE, or ordinal rank.

3 tasks

Listwise Tasks

Choose the best candidate from a set of analogues, mirroring medicinal chemistry next-step decisions.

2 tasks

Tasks by Category

Pointwise Tasks

2 tasks

Predict an outcome for a single ligand-target pair (classification or regression).

01
Binary Classification
Predict whether a ligand binds the target.
Metrics: AUROC, AUPRC, EF 1%, EF 5%
When important: Early hit-finding and triage, when many virtual or designed compounds must be filtered before assay spend.
Simulates: A go/no-go call on whether a ligand is likely to bind strongly enough to be worth progressing.
Metric utility: AUROC/AUPRC/EF report how well actives are enriched near the top, which is what matters for practical screening queues.
02
Regression
Predict continuous binding affinity (pX).
Metrics: R², RMSE, Spearman ρ
When important: Hit-to-lead and lead optimisation, when teams want estimates of absolute potency changes, not only rank order.
Simulates: Forecasting expected pX for a proposed analogue before synthesis and testing.
Metric utility: R² and RMSE measure calibration quality, while Spearman checks whether useful ordering is preserved.

Pairwise Tasks

3 tasks

Compare two ligands directly and decide which one is better on affinity, LLE, or ordinal rank.

01
Pairwise Affinity
Which ligand binds more strongly?
Metrics: Pairwise Accuracy, Kendall τ
When important: Routine analogue prioritisation when chemists compare two close modifications in the same series.
Simulates: The direct decision: which of these two compounds should be made or tested first for potency.
Metric utility: Pairwise accuracy directly maps to decision correctness for binary choose-A-or-B comparisons.
02
Pairwise LLE
Which ligand has better lipophilic efficiency?
Metrics: Pairwise Accuracy, Kendall τ
When important: Lead optimisation when potency must be balanced against lipophilicity and developability risk.
Simulates: Choosing between two analogues where one may be potent but too greasy and the other more efficient.
Metric utility: Pairwise LLE accuracy reveals whether the model can prioritize potency efficiency rather than raw potency alone.
03
Pairwise Ordinal
Which ligand has a better activity rank?
Metrics: Pairwise Accuracy
When important: SAR table review using bucketed activity labels, especially in early and mid-cycle design reviews.
Simulates: A quick side-by-side call between two analogues when only ordinal strength information is available.
Metric utility: Pairwise accuracy is easy to interpret as percent correct in realistic head-to-head ranking decisions.

Listwise Tasks

2 tasks

Choose the best candidate from a set of analogues, mirroring medicinal chemistry next-step decisions.

01
SAR Winner (Ordinal)
Identify the potency-improving modification.
Metrics: Top-1 Accuracy, Kendall τ
When important: Hit-to-lead and lead optimisation planning, when selecting a small set of analogues to synthesize next.
Simulates: Selecting the likely best analogue from a mini design set under practical make/test constraints.
Metric utility: Top-1 winner accuracy reflects whether the model picks the same next compound a chemist would want to prioritize.
02
SAR Winner (LLE)
Identify the LLE-improving modification.
Metrics: Top-1 Accuracy, Kendall τ
When important: Later lead optimisation when multi-parameter quality matters and potency alone is not enough.
Simulates: Picking the best analogue under potency-efficiency pressure (harder but closer to real project utility).
Metric utility: Top-1 LLE winner accuracy and rank correlation show whether recommendations remain useful under stricter optimisation objectives.

Benchmark Tasks

Why these tasks exist

Task Classes

Pointwise Tasks

Pairwise Tasks

Listwise Tasks

Tasks by Category

Pointwise Tasks

Binary Classification

Regression

Pairwise Tasks

Pairwise Affinity

Pairwise LLE

Pairwise Ordinal

Listwise Tasks

SAR Winner (Ordinal)

SAR Winner (LLE)