PAYN: Positivity is All You Need

Data augmentation of negative data of organic reactions
for improved machine learning yield predictions.

Description • Key Features • Documentation • Installation • Quick Start • Architecture • Citation

Description

PAYN (Positivity Is All You Need) is an open-source Python framework designed to address the "missing negative data" problem in organic reaction datasets (e.g., Reaxys, SciFinder, reaction scopes). Due to publication bias, literature datasets are heavily skewed towards successful reactions, severely limiting the generalization capability of Machine Learning models.

PAYN leverages Positive-Unlabeled (PU) Learning and the Spy Technique to recover "Reliable Negatives" from unlabeled data spaces. By statistically identifying failed reactions hidden within sparse datasets, PAYN constructs balanced, high-quality training sets that significantly improve yield prediction accuracy.

Key Features

Spy-Based PU Learning: Implements controlled-ratio splitting and dynamic thresholding to identify reliable negatives with high precision.
Featurization: Supports ECFP, our proprietary Multi-Feature Fingerprinting (MFF) with automated bit condensation and custom, pregenerated features.
Reproducible Configuration: "Config-as-Code" architecture using hierarchical YAML/JSON files and strict schema validation.
Automated Optimization: Integrated Bayesian Optimization (via Optuna) for both CatBoost classifiers (Spy Model) and regressors (Yield Prediction Model).
Scientific Logging: Centralized MLflow integration for artifact serialization, hyperparameter tracking, and reproducibility snapshots.
HPC Ready: Automatic detection of SLURM environments for optimized thread allocation.

Documentation

Full documentation, including API references, is available online at https://GloriusGroup.github.io/PAYN/.

Installation

PAYN requires Python 3.12. We assume the use of Poetry for strict dependency management and reproducibility.

Poetry (Recommended)

Clone the repository:

git clone https://github.com/GloriusGroup/PAYN.git
cd PAYN

Install dependencies:
```
poetry install
```
Activate the environment:
```
poetry shell
```

Quick Start

The core workflow is orchestrated by run_spy.py. You can trigger a run using a configuration file, and optionally override parameters via the CLI (useful for SLURM job arrays or quick experiments).

1. Prepare your config: Ensure you have a config.yaml defined in the root directory.

2. Run the pipeline:

# Run with default config
poetry run python run_spy.py

# Run with CLI overrides (e.g., changing the spy tolerance)
poetry run python run_spy.py --spy_splitting_spy_tolerance 0.10

3. View Results: Check the mlruns/ directory or launch the MLflow UI to visualize metrics:

poetry run mlflow ui

Module Architecture

The repository is structured into modular components to enforce a strict separation of concerns, ensuring reproducibility and extensibility:

Module	Description
`payn.ConfigLoader`	Manages hierarchical configuration (YAML/JSON) and generates dynamic CLI arguments for SLURM integration.
`payn.Logging`	Centralizes experimental tracking via MLflow, enforcing artifact serialization and parameter provenance.
`payn.DataSchema`	Enforces runtime schema validation and mathematically verifies index disjointness to prevent data leakage.
`payn.Featurisation`	Orchestrates SMILES-to-Fingerprint transformation, supporting ECFP, Multi-Feature Fingerprinting (MFF) and custom precalculated features.
`payn.Splitting`	Reproducibly partitions and cross validates data using Random, Scaffold, or Butina clustering strategies.
`payn.SpySplitting`	Transforms fully labeled data into PU data and injects known positives ("Spies") into Unlabeled set.
`payn.AugmentationModels`	Contains the `SpyModel` for PU classification and the engine for identifying Reliable Negatives via dynamic thresholding.
`payn.Optimization`	Performs hyperparameter tuning (Bayesian TPE or Grid Search) with deterministic state handling for reproducibility.
`payn.Recombination`	Constructs balanced datasets for downstream tasks by merging verified positives with identified reliable negatives.
`payn.RegModel`	Wraps `CatBoostRegressor` for the final yield prediction task, handling categorical features and parallel execution.
`payn.Evaluator`	Computes specialized PU metrics (Negative Precision/Recall) to assess the purity of the identified negative set.
`payn.Visualisation`	Generates diagnostic plots for data distributions, hyperparameter importance, and optimization history.

Testing & Validation

PAYN includes a comprehensive test suite of >130 tests focusing on reproducibility and determinism.

Determinism: All stochastic processes (splitting, featurization) are verified to be bit-for-bit reproducible given a fixed seed.
Logic Isolation: External heavy dependencies (CatBoost, RDKit, Optuna) are mocked to verify orchestration logic.
Data Invariants: Tests lossless splitting and Train / Validation / Test set disjointness.

Run the test suite:

poetry run pytest tests/

License

This project is licensed under the MIT License. See the LICENSE file for details.

Authors

Florian Boser - University of Münster
Jan Christopher Spies - University of Münster
Frank Glorius - University of Münster

Citation

Preprint is available free of charge: https://doi.org/10.26434/chemrxiv-2025-hq4rx

@article{Boser2025,
  title = {Positivity is All You Need (PAYN): A PU Learning Framework for Yield Prediction in Organic Chemistry},
  author = {Boser, Florian and Spies, Jan Christopher and Glorius, Frank},
  journal = {ChemRxiv},
  year = {2025},
  doi = {10.26434/chemrxiv-2025-hq4rx},
  url = {https://doi.org/10.26434/chemrxiv-2025-hq4rx}
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Experiments		Experiments
datasets		datasets
docs		docs
payn		payn
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
run_spy.py		run_spy.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PAYN: Positivity is All You Need

Description

Key Features

Documentation

Installation

Poetry (Recommended)

Quick Start

Module Architecture

Testing & Validation

License

Authors

Citation

About

Uh oh!

Releases

Packages

Languages

License

GloriusGroup/PAYN

Folders and files

Latest commit

History

Repository files navigation

PAYN: Positivity is All You Need

Description

Key Features

Documentation

Installation

Poetry (Recommended)

Quick Start

Module Architecture

Testing & Validation

License

Authors

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages