Data

How the PatentBind dataset was constructed, what it contains, and the role of synthetic negatives.

Dataset Composition

Canonical data tables extracted from medicinal chemistry patents.

Targets

PKMYT1 kinase (primary), CDK1 (sparse)

Ligands

600+

Small molecules with SMILES, InChIKey, descriptors

Measurements

1000+

IC₅₀ values with censoring flags and ordinal ranks

Assays

Biochemical IC₅₀ assays with full condition metadata

Patents

WO2024112853A1 — PKMYT1 inhibitors

Synthetic Ligands

25+

Generated negatives for rebalancing

Data Pipeline

From raw patent extractions to benchmark task datasets.

Raw Extractions→Ingest & Validate→Canonical CSVs→Generate Tasks→8 Task Datasets

Step 1: Extraction — Patent SAR tables are extracted into structured JSON, capturing SMILES, assay conditions, and activity values.

Step 2: Ingestion — Raw extractions are normalised into canonical CSV tables (targets, ligands, assays, measurements). RDKit computes molecular descriptors.

Step 3: Synthetic negatives — Plausible non-binders are generated to rebalance the dataset.

Step 4: Task generation — Seven benchmark tasks are derived: classification, regression, pairwise (affinity, LLE, ordinal), and SAR winner (ordinal, LLE).

Activity Scale

All activities are expressed as pX = −log₁₀(value in molar).

Binder

pX ≥ 6.0

≤ 1 µM

Gray Zone

5.0 < pX < 6.0

1–10 µM (excluded)

Non-binder

pX ≤ 5.0

≥ 10 µM

Synthetic Negatives

Plausible non-binders generated from known actives to rebalance the inherently active-biased patent data.

Patent datasets are heavily biased toward active compounds. Without non-binders, classification becomes trivial and models may learn simple correlations (size, lipophilicity) rather than genuine protein–ligand interactions. Synthetic negatives recreate realistic failure modes while preserving core scaffold chemistry.

Generation Methods

R-group Inflation

Replaces small substituents (methyl, ethyl, chloro) with bulky groups (tert-butyl, phenyl, benzyl) to introduce steric clashes in the binding pocket.

CH₃ → C(CH₃)₃

Lipophilicity Inflation

Adds hydrophobic substituents (alkyl chains, aromatic rings) to exceed the pocket's lipophilicity tolerance, disrupting polar interactions and increasing desolvation penalties.

—OH → —OC₆H₅

SAR-breaking Substitutions

Replaces functional groups critical for binding (H-bond donors, heterocycles, amides) with chemically related but interaction-disrupting groups.

—NH₂ → —CH₃

Substituent Permutation

Moves large substituents to positions where they weren't originally reported, creating molecules that are likely too bulky or poorly oriented for binding.

R₁=Ph, R₂=H → R₁=H, R₂=Ph

Important Caveat

Synthetic negatives are not experimental measurements. They represent plausible failed designs, not confirmed non-binders. They are used only for classification and ranking tasks, and never for regression.