PatentBind

Data

How the PatentBind dataset was constructed, what it contains, and the role of synthetic negatives.

Dataset Composition

Canonical data tables extracted from medicinal chemistry patents.

Targets
2
PKMYT1 kinase (primary), CDK1 (sparse)
Ligands
600+
Small molecules with SMILES, InChIKey, descriptors
Measurements
1000+
IC₅₀ values with censoring flags and ordinal ranks
Assays
2
Biochemical IC₅₀ assays with full condition metadata
Patents
1
WO2024112853A1 — PKMYT1 inhibitors
Synthetic Ligands
25+
Generated negatives for rebalancing

Data Pipeline

From raw patent extractions to benchmark task datasets.

Raw ExtractionsIngest & ValidateCanonical CSVsGenerate Tasks8 Task Datasets

Step 1: Extraction — Patent SAR tables are extracted into structured JSON, capturing SMILES, assay conditions, and activity values.

Step 2: Ingestion — Raw extractions are normalised into canonical CSV tables (targets, ligands, assays, measurements). RDKit computes molecular descriptors.

Step 3: Synthetic negatives — Plausible non-binders are generated to rebalance the dataset.

Step 4: Task generation — Eight benchmark tasks are derived: classification, regression, ordinal, pairwise (affinity, LLE, ordinal), and SAR winner (ordinal, LLE).

Activity Scale

All activities are expressed as pX = −log₁₀(value in molar).

Binder
pX ≥ 6.0
≤ 1 µM
Gray Zone
5.0 < pX < 6.0
1–10 µM (excluded)
Non-binder
pX ≤ 5.0
≥ 10 µM

Synthetic Negatives

Plausible non-binders generated from known actives to rebalance the inherently active-biased patent data.

Patent datasets are heavily biased toward active compounds. Without non-binders, classification becomes trivial and models may learn simple correlations (size, lipophilicity) rather than genuine protein–ligand interactions. Synthetic negatives recreate realistic failure modes while preserving core scaffold chemistry.

Generation Methods

R-group Inflation

Replaces small substituents (methyl, ethyl, chloro) with bulky groups (tert-butyl, phenyl, benzyl) to introduce steric clashes in the binding pocket.

CH₃ → C(CH₃)₃
Lipophilicity Inflation

Adds hydrophobic substituents (alkyl chains, aromatic rings) to exceed the pocket's lipophilicity tolerance, disrupting polar interactions and increasing desolvation penalties.

—OH → —OC₆H₅
SAR-breaking Substitutions

Replaces functional groups critical for binding (H-bond donors, heterocycles, amides) with chemically related but interaction-disrupting groups.

—NH₂ → —CH₃
Substituent Permutation

Moves large substituents to positions where they weren't originally reported, creating molecules that are likely too bulky or poorly oriented for binding.

R₁=Ph, R₂=H → R₁=H, R₂=Ph

Important Caveat

Synthetic negatives are not experimental measurements. They represent plausible failed designs, not confirmed non-binders. They are used only for classification and ranking tasks, and never for regression.