Data augmentation of negative data of organic reactions
for improved machine learning yield predictions.
Description • Key Features • Documentation • Installation • Quick Start • Architecture • Citation
PAYN (Positivity Is All You Need) is an open-source Python framework designed to address the "missing negative data" problem in organic reaction datasets (e.g., Reaxys, SciFinder, reaction scopes). Due to publication bias, literature datasets are heavily skewed towards successful reactions, severely limiting the generalization capability of Machine Learning models.
PAYN leverages Positive-Unlabeled (PU) Learning and the Spy Technique to recover "Reliable Negatives" from unlabeled data spaces. By statistically identifying failed reactions hidden within sparse datasets, PAYN constructs balanced, high-quality training sets that significantly improve yield prediction accuracy.
- Spy-Based PU Learning: Implements controlled-ratio splitting and dynamic thresholding to identify reliable negatives with high precision.
- Featurization: Supports ECFP, our proprietary Multi-Feature Fingerprinting (MFF) with automated bit condensation and custom, pregenerated features.
- Reproducible Configuration: "Config-as-Code" architecture using hierarchical YAML/JSON files and strict schema validation.
- Automated Optimization: Integrated Bayesian Optimization (via Optuna) for both CatBoost classifiers (Spy Model) and regressors (Yield Prediction Model).
- Scientific Logging: Centralized MLflow integration for artifact serialization, hyperparameter tracking, and reproducibility snapshots.
- HPC Ready: Automatic detection of SLURM environments for optimized thread allocation.
Full documentation, including API references, is available online at https://GloriusGroup.github.io/PAYN/.
PAYN requires Python 3.12. We assume the use of Poetry for strict dependency management and reproducibility.
-
Clone the repository:
git clone https://github.com/GloriusGroup/PAYN.git cd PAYN -
Install dependencies:
poetry install
-
Activate the environment:
poetry shell
The core workflow is orchestrated by run_spy.py. You can trigger a run using a configuration file, and optionally override parameters via the CLI (useful for SLURM job arrays or quick experiments).
1. Prepare your config:
Ensure you have a config.yaml defined in the root directory.
2. Run the pipeline:
# Run with default config
poetry run python run_spy.py
# Run with CLI overrides (e.g., changing the spy tolerance)
poetry run python run_spy.py --spy_splitting_spy_tolerance 0.103. View Results: Check the mlruns/ directory or launch the MLflow UI to visualize metrics:
poetry run mlflow uiThe repository is structured into modular components to enforce a strict separation of concerns, ensuring reproducibility and extensibility:
| Module | Description |
|---|---|
payn.ConfigLoader |
Manages hierarchical configuration (YAML/JSON) and generates dynamic CLI arguments for SLURM integration. |
payn.Logging |
Centralizes experimental tracking via MLflow, enforcing artifact serialization and parameter provenance. |
payn.DataSchema |
Enforces runtime schema validation and mathematically verifies index disjointness to prevent data leakage. |
payn.Featurisation |
Orchestrates SMILES-to-Fingerprint transformation, supporting ECFP, Multi-Feature Fingerprinting (MFF) and custom precalculated features. |
payn.Splitting |
Reproducibly partitions and cross validates data using Random, Scaffold, or Butina clustering strategies. |
payn.SpySplitting |
Transforms fully labeled data into PU data and injects known positives ("Spies") into Unlabeled set. |
payn.AugmentationModels |
Contains the SpyModel for PU classification and the engine for identifying Reliable Negatives via dynamic thresholding. |
payn.Optimization |
Performs hyperparameter tuning (Bayesian TPE or Grid Search) with deterministic state handling for reproducibility. |
payn.Recombination |
Constructs balanced datasets for downstream tasks by merging verified positives with identified reliable negatives. |
payn.RegModel |
Wraps CatBoostRegressor for the final yield prediction task, handling categorical features and parallel execution. |
payn.Evaluator |
Computes specialized PU metrics (Negative Precision/Recall) to assess the purity of the identified negative set. |
payn.Visualisation |
Generates diagnostic plots for data distributions, hyperparameter importance, and optimization history. |
PAYN includes a comprehensive test suite of >130 tests focusing on reproducibility and determinism.
- Determinism: All stochastic processes (splitting, featurization) are verified to be bit-for-bit reproducible given a fixed seed.
- Logic Isolation: External heavy dependencies (CatBoost, RDKit, Optuna) are mocked to verify orchestration logic.
- Data Invariants: Tests lossless splitting and Train / Validation / Test set disjointness.
Run the test suite:
poetry run pytest tests/This project is licensed under the MIT License. See the LICENSE file for details.
Preprint is available free of charge: https://doi.org/10.26434/chemrxiv-2025-hq4rx
@article{Boser2025,
title = {Positivity is All You Need (PAYN): A PU Learning Framework for Yield Prediction in Organic Chemistry},
author = {Boser, Florian and Spies, Jan Christopher and Glorius, Frank},
journal = {ChemRxiv},
year = {2025},
doi = {10.26434/chemrxiv-2025-hq4rx},
url = {https://doi.org/10.26434/chemrxiv-2025-hq4rx}
}