-
Notifications
You must be signed in to change notification settings - Fork 0
Updated regerssion sampler #251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| if max_allowed <= 0: | ||
| return pool.head(0).copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant. There is a check above if pool.empty or target <= 0.
max_allowed will be less than or equal to 0 only if max_oversampling_ratio is less than 0.
Which would never be the case, right?
ajude2s
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome work, Abbas. 👍
I have added minor changes and some suggestions.
Also, should we not add the original sampler (the fixed distribution which yields "best" performance) as well?
This PR introduces a configurable uniform split sampler for JSONL datasets, along with its configuration and tests. 🎯
Adds
UniformSplitSampler, which:scorefield per example and normalizes these scores._build_splitssplit_label_poolsto partition each label’s pool into train/val.max_oversampling_ratiocap so rare labels are upsampled but not duplicated arbitrarily.In short: this sampler helps you build balanced, reproducible, and well-logged train/val splits from JSONL.