-
-
Notifications
You must be signed in to change notification settings - Fork 1k
Sequential sample packing #2404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
10c7f21
to
3e30f45
Compare
@DreamGenX can you make this PR editable my maintainers? I'm happy to help get the linting and tests passing for this |
@winglian done, thank you |
cd8460b
to
8ffbda0
Compare
@@ -455,13 +455,18 @@ def calculate_total_num_steps(cfg, train_dataset, update=True): | |||
else: | |||
sampler_batch_size = cfg.micro_batch_size | |||
batch_max_len = cfg.sequence_len | |||
if cfg.curriculum_sampling: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it make sense to have pydantic validation to warn if curriculum sampling is not enabled when sample_packing_sequentially is enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a LOG.warn in the dataset class:
if self.sequential and not isinstance(sampler, SequentialSampler):
LOG.warn(
"using sequential sample packing with non-sequential sampler, did you want to also enable curriculum_sampling?"
)
Though it aslo makes sense to use the sequential packing without curriculum sampling. That way you get the same order as if you did not do packing. Can be useful way to get the benefits of packing with minimal diff.
8ffbda0
to
f70d84d
Compare
f70d84d
to
f107f54
Compare
I rebased again against the latest main since there were still some conflicts and linting issues. Hopefully I didn't screw up the implementation. |
datasets: | ||
- path: mhenrichsen/alpaca_2k_test | ||
type: alpaca | ||
- path: mhenrichsen/alpaca_2k_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this double dataset intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's how it was in the original config, it just doubles the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I think you were basing off a config meant for deduplication examples/llama-3/lora-1b-deduplicate-sft.yml
. We won't need this for your case.
val_set_size: 0.0 | ||
output_dir: ./outputs/lora-out | ||
|
||
test_value: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe forgot to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, seems unnecessary -- it's a leftover from the config i forked
sample_packing_sequentially: true | ||
curriculum_sampling: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add both of these to documentation docs/config.qmd
to explain them?
if self.sequential and not isinstance(sampler, SequentialSampler): | ||
LOG.warn( | ||
"using sequential sample packing with non-sequential sampler, did you want to also enable curriculum_sampling?" | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to the earlier discussion, could we move this validation to the schema src/axolotl/utils/schemas/config.py
and not here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can change the check to be: if sample_pack_sequential and not curriculum_sampling instead
@winglian @NanoCode012 BTW, when i run |
f107f54
to
51f607d
Compare
Description
This change adds support for next-fit bin packing in MultipackBatchSampler. What this means is that we can now use sample packing while preserving the order of the examples.
Motivation and Context
Current sample packing affects order of examples in ways that can affect training results. This PR adds a new flag:
--sample_pack_sequentially
which uses a simple greedy / sequential next-fit bin packing.If you use
--sample_pack_sequentially
alone, the order of the examples will be determined by the underlyingRandomSampler
. If you use--sample_pack_sequentially
with--curriculum_sampling
, the order of the examples will be the same as in the training data. Both options make sense and differ from the current settings.For example, previously it was not possible to do proper curriculum learning since the bin packing algorithm would reorder the examples anyway.
How has this been tested?
I tested it by running the new config:
examples/llama-3/lora-1b-sample-packing-sequentially.yml
On this dataset, packing efficiency is high (>97%).
I would also like to bring attention to a potential bug in the multipack code:
group_size
andbin_size