Minimal ML template with a simple data pipeline and model registry.
uv sync --group devdata/<dataset_name>/
├── raw/
│ ├── full.parquet
│ ├── train.parquet
│ └── test.parquet
└── preprocessed/
├── full.parquet
├── train.parquet
└── test.parquet
outputs/
├── gridsearch/ # cv_summary.json
├── models/ # index.json, current_best.json, *.skops
└── metrics/{model_id}/ # predictions.parquet, metrics.json, plots/
Edit configs/default.toml:
[data]
dataset_name = "example_data" # name of dataset, gets own directory
target_name = "target" # column name of the target variable
[eval]
random_state = 123
test_size = 0.20
cv_splits = 5
# Choose ONE scorer you intend to use, by default the MAX is chosen
scoring = "neg_root_mean_squared_error" # regression example
# scoring = "roc_auc" # classification example
n_jobs = -1
[search]
model_keys = ["<modelname>"]
[train] # optional explicit spec; if unset, train reuses registry "best"
## example for support vector machine
# model_key = "svm_rbf"
# [train.params]
# model__C = 3.0
# model__gamma = "scale"
[predict]
model_id = "best" # "best" or a concrete model id from the registryGrid search uses the single metric specified in [eval].scoring. The registry records cv_score_type (scorer name) and cv_score (numeric value).
- Add models to
src/package/models.py - Add grid to
src/package/grid.pyasGRID_SPACES[<model_key>] - Include desired model keys in
[search].model_keys - Set the scoring method for cv in
configs/*.toml - Define the metrics used to evaluate predictions in
src/package/eval/metrics.py - Define the plots used to visualize predictions in
src/package/eval/plots.py
# Ingest data into raw/full
uv run package register-data --in <path-of-data>
# Copy raw to preprocessed (add transformations here)
uv run package preprocess
# Split preprocessed full into train/test
uv run package split --stage pre
# Grid search on preprocessed train
uv run package search
# Predict on preprocessed test with best model
uv run package predict --plots
# View registry (sorted by cv_score desc, then recency)
uv run package models --top 5All commands accept --config or -c to specify a config file (default: configs/default.toml).
| Command | Description |
|---|---|
uv run package register-data --in <file> |
Read CSV/Parquet and write to raw/full.parquet |
uv run package preprocess |
Preprocess and copy raw/full.parquet → preprocessed/full.parquet |
uv run package split --stage {raw|pre} |
Split of <stage>/full.parquet into train/test |
uv run package search |
GridSearchCV on preprocessed/train; registers best estimators trained on full training data |
uv run package train |
Fit the best or specified model on preprocessed/train; registers model artifact as .skops |
uv run package predict [--model-id <id>] [--plots] [--plots-out <dir>] |
Predict on preprocessed/test; saves metrics and plots |
uv run package models [--top K] |
Show registry: id, model, cv_score_type, cv_score, created_at, params |
The registry stores each model with its evaluation scorer. When using --model-id best, the best model among entries matching the current [eval].scoring in the provided config file is selected. If no model exists for that scorer, run package search (or package train --model-id) with that scorer first.