Skip to content

Conversation

@AbasKhan
Copy link
Collaborator

This PR introduces a configurable uniform split sampler for JSONL datasets, along with its configuration and tests. 🎯


Adds UniformSplitSampler, which:

  • 📥 Ingests one or more input JSONL files.
  • 🔢 Requires a numeric integer score field per example and normalizes these scores.
  • ✂️ Produces balanced train/validation JSONL outputs according to the configured split ratio.

_build_splits

  • 🧩 Groups examples by label and computes per-label quotas for train/validation.
  • 🧷 Uses split_label_pools to partition each label’s pool into train/val.
  • 📈 Applies sampling with a max_oversampling_ratio cap so rare labels are upsampled but not duplicated arbitrarily.
  • 🔀 Shuffles the resulting splits to avoid ordering artifacts.
  • 📊 Logs overall and per-label distributions for transparency and easier debugging.

In short: this sampler helps you build balanced, reproducible, and well-logged train/val splits from JSONL.

@AbasKhan AbasKhan requested a review from ajude2s December 11, 2025 00:47
Comment on lines 75 to 76
if max_allowed <= 0:
return pool.head(0).copy()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant. There is a check above if pool.empty or target <= 0.
max_allowed will be less than or equal to 0 only if max_oversampling_ratio is less than 0.

Which would never be the case, right?

Copy link
Collaborator

@ajude2s ajude2s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome work, Abbas. 👍
I have added minor changes and some suggestions.

Also, should we not add the original sampler (the fixed distribution which yields "best" performance) as well?

@AbasKhan AbasKhan requested a review from ajude2s December 12, 2025 16:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants