Data
How the PatentBind dataset was constructed, what it contains, and the role of synthetic negatives.
Dataset Composition
Canonical data tables extracted from medicinal chemistry patents.
Data Pipeline
From raw patent extractions to benchmark task datasets.
Step 1: Extraction — Patent SAR tables are extracted into structured JSON, capturing SMILES, assay conditions, and activity values.
Step 2: Ingestion — Raw extractions are normalised into canonical CSV tables (targets, ligands, assays, measurements). RDKit computes molecular descriptors.
Step 3: Synthetic negatives — Plausible non-binders are generated to rebalance the dataset.
Step 4: Task generation — Eight benchmark tasks are derived: classification, regression, ordinal, pairwise (affinity, LLE, ordinal), and SAR winner (ordinal, LLE).
Activity Scale
All activities are expressed as pX = −log₁₀(value in molar).
Synthetic Negatives
Plausible non-binders generated from known actives to rebalance the inherently active-biased patent data.
Patent datasets are heavily biased toward active compounds. Without non-binders, classification becomes trivial and models may learn simple correlations (size, lipophilicity) rather than genuine protein–ligand interactions. Synthetic negatives recreate realistic failure modes while preserving core scaffold chemistry.
Generation Methods
Replaces small substituents (methyl, ethyl, chloro) with bulky groups (tert-butyl, phenyl, benzyl) to introduce steric clashes in the binding pocket.
CH₃ → C(CH₃)₃Adds hydrophobic substituents (alkyl chains, aromatic rings) to exceed the pocket's lipophilicity tolerance, disrupting polar interactions and increasing desolvation penalties.
—OH → —OC₆H₅Replaces functional groups critical for binding (H-bond donors, heterocycles, amides) with chemically related but interaction-disrupting groups.
—NH₂ → —CH₃Moves large substituents to positions where they weren't originally reported, creating molecules that are likely too bulky or poorly oriented for binding.
R₁=Ph, R₂=H → R₁=H, R₂=PhImportant Caveat
Synthetic negatives are not experimental measurements. They represent plausible failed designs, not confirmed non-binders. They are used only for classification and ranking tasks, and never for regression.