Skip to content

PAYN: Positivity is All You Need: Framework for augmentation of reliable negatives from biased organic datasets.

License

Notifications You must be signed in to change notification settings

GloriusGroup/PAYN

Repository files navigation

PAYN: Positivity is All You Need

PAYN Logo
Data augmentation of negative data of organic reactions
for improved machine learning yield predictions.

License: MIT Python 3.12 Docs Testing Code Style: Google Preprint

DescriptionKey FeaturesDocumentationInstallationQuick StartArchitectureCitation


Description

PAYN (Positivity Is All You Need) is an open-source Python framework designed to address the "missing negative data" problem in organic reaction datasets (e.g., Reaxys, SciFinder, reaction scopes). Due to publication bias, literature datasets are heavily skewed towards successful reactions, severely limiting the generalization capability of Machine Learning models.

PAYN leverages Positive-Unlabeled (PU) Learning and the Spy Technique to recover "Reliable Negatives" from unlabeled data spaces. By statistically identifying failed reactions hidden within sparse datasets, PAYN constructs balanced, high-quality training sets that significantly improve yield prediction accuracy.

Key Features

  • Spy-Based PU Learning: Implements controlled-ratio splitting and dynamic thresholding to identify reliable negatives with high precision.
  • Featurization: Supports ECFP, our proprietary Multi-Feature Fingerprinting (MFF) with automated bit condensation and custom, pregenerated features.
  • Reproducible Configuration: "Config-as-Code" architecture using hierarchical YAML/JSON files and strict schema validation.
  • Automated Optimization: Integrated Bayesian Optimization (via Optuna) for both CatBoost classifiers (Spy Model) and regressors (Yield Prediction Model).
  • Scientific Logging: Centralized MLflow integration for artifact serialization, hyperparameter tracking, and reproducibility snapshots.
  • HPC Ready: Automatic detection of SLURM environments for optimized thread allocation.

Documentation

Full documentation, including API references, is available online at https://GloriusGroup.github.io/PAYN/.

Installation

PAYN requires Python 3.12. We assume the use of Poetry for strict dependency management and reproducibility.

Poetry (Recommended)

  1. Clone the repository:

    git clone https://github.com/GloriusGroup/PAYN.git
    cd PAYN
  2. Install dependencies:

    poetry install
  3. Activate the environment:

    poetry shell

Quick Start

The core workflow is orchestrated by run_spy.py. You can trigger a run using a configuration file, and optionally override parameters via the CLI (useful for SLURM job arrays or quick experiments).

1. Prepare your config: Ensure you have a config.yaml defined in the root directory.

2. Run the pipeline:

# Run with default config
poetry run python run_spy.py

# Run with CLI overrides (e.g., changing the spy tolerance)
poetry run python run_spy.py --spy_splitting_spy_tolerance 0.10

3. View Results: Check the mlruns/ directory or launch the MLflow UI to visualize metrics:

poetry run mlflow ui

Module Architecture

The repository is structured into modular components to enforce a strict separation of concerns, ensuring reproducibility and extensibility:

Module Description
payn.ConfigLoader Manages hierarchical configuration (YAML/JSON) and generates dynamic CLI arguments for SLURM integration.
payn.Logging Centralizes experimental tracking via MLflow, enforcing artifact serialization and parameter provenance.
payn.DataSchema Enforces runtime schema validation and mathematically verifies index disjointness to prevent data leakage.
payn.Featurisation Orchestrates SMILES-to-Fingerprint transformation, supporting ECFP, Multi-Feature Fingerprinting (MFF) and custom precalculated features.
payn.Splitting Reproducibly partitions and cross validates data using Random, Scaffold, or Butina clustering strategies.
payn.SpySplitting Transforms fully labeled data into PU data and injects known positives ("Spies") into Unlabeled set.
payn.AugmentationModels Contains the SpyModel for PU classification and the engine for identifying Reliable Negatives via dynamic thresholding.
payn.Optimization Performs hyperparameter tuning (Bayesian TPE or Grid Search) with deterministic state handling for reproducibility.
payn.Recombination Constructs balanced datasets for downstream tasks by merging verified positives with identified reliable negatives.
payn.RegModel Wraps CatBoostRegressor for the final yield prediction task, handling categorical features and parallel execution.
payn.Evaluator Computes specialized PU metrics (Negative Precision/Recall) to assess the purity of the identified negative set.
payn.Visualisation Generates diagnostic plots for data distributions, hyperparameter importance, and optimization history.

Testing & Validation

PAYN includes a comprehensive test suite of >130 tests focusing on reproducibility and determinism.

  • Determinism: All stochastic processes (splitting, featurization) are verified to be bit-for-bit reproducible given a fixed seed.
  • Logic Isolation: External heavy dependencies (CatBoost, RDKit, Optuna) are mocked to verify orchestration logic.
  • Data Invariants: Tests lossless splitting and Train / Validation / Test set disjointness.

Run the test suite:

poetry run pytest tests/

License

This project is licensed under the MIT License. See the LICENSE file for details.

Authors

  • Florian Boser - University of Münster
    ORCID Email

  • Jan Christopher Spies - University of Münster
    ORCID Email

  • Frank Glorius - University of Münster
    ORCID Email

Citation

Preprint is available free of charge: https://doi.org/10.26434/chemrxiv-2025-hq4rx

@article{Boser2025,
  title = {Positivity is All You Need (PAYN): A PU Learning Framework for Yield Prediction in Organic Chemistry},
  author = {Boser, Florian and Spies, Jan Christopher and Glorius, Frank},
  journal = {ChemRxiv},
  year = {2025},
  doi = {10.26434/chemrxiv-2025-hq4rx},
  url = {https://doi.org/10.26434/chemrxiv-2025-hq4rx}
}

About

PAYN: Positivity is All You Need: Framework for augmentation of reliable negatives from biased organic datasets.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages