Sequential sample packing #2404

DreamGenX · 2025-03-11T16:18:15Z

Description

This change adds support for next-fit bin packing in MultipackBatchSampler. What this means is that we can now use sample packing while preserving the order of the examples.

Motivation and Context

Current sample packing affects order of examples in ways that can affect training results. This PR adds a new flag:
--sample_pack_sequentially which uses a simple greedy / sequential next-fit bin packing.

If you use --sample_pack_sequentially alone, the order of the examples will be determined by the underlying RandomSampler. If you use --sample_pack_sequentially with --curriculum_sampling, the order of the examples will be the same as in the training data. Both options make sense and differ from the current settings.

For example, previously it was not possible to do proper curriculum learning since the bin packing algorithm would reorder the examples anyway.

How has this been tested?

I tested it by running the new config: examples/llama-3/lora-1b-sample-packing-sequentially.yml
On this dataset, packing efficiency is high (>97%).

I would also like to bring attention to a potential bug in the multipack code:

Here we set group_size and bin_size
But it's never used.

winglian · 2025-03-20T04:25:08Z

@DreamGenX can you make this PR editable my maintainers? I'm happy to help get the linting and tests passing for this

DreamGenX · 2025-03-20T05:30:34Z

@winglian done, thank you

winglian · 2025-03-21T15:57:36Z

src/axolotl/utils/trainer.py

@@ -455,13 +455,18 @@ def calculate_total_num_steps(cfg, train_dataset, update=True):
            else:
                sampler_batch_size = cfg.micro_batch_size
                batch_max_len = cfg.sequence_len
+            if cfg.curriculum_sampling:


would it make sense to have pydantic validation to warn if curriculum sampling is not enabled when sample_packing_sequentially is enabled?

There's a LOG.warn in the dataset class:

if self.sequential and not isinstance(sampler, SequentialSampler): LOG.warn( "using sequential sample packing with non-sequential sampler, did you want to also enable curriculum_sampling?" )

Though it aslo makes sense to use the sequential packing without curriculum sampling. That way you get the same order as if you did not do packing. Can be useful way to get the benefits of packing with minimal diff.

winglian · 2025-03-23T12:53:27Z

I rebased again against the latest main since there were still some conflicts and linting issues. Hopefully I didn't screw up the implementation.

NanoCode012 · 2025-03-24T08:12:47Z

examples/llama-3/lora-1b-sample-packing-sequentially.yml

+datasets:
+  - path: mhenrichsen/alpaca_2k_test
+    type: alpaca
+  - path: mhenrichsen/alpaca_2k_test


Is this double dataset intentional?

that's how it was in the original config, it just doubles the data

Oh, I think you were basing off a config meant for deduplication examples/llama-3/lora-1b-deduplicate-sft.yml. We won't need this for your case.

NanoCode012 · 2025-03-24T08:13:00Z

examples/llama-3/lora-1b-sample-packing-sequentially.yml

+val_set_size: 0.0
+output_dir: ./outputs/lora-out
+
+test_value: true


Maybe forgot to remove this?

yep, seems unnecessary -- it's a leftover from the config i forked

NanoCode012 · 2025-03-24T08:13:54Z

examples/llama-3/lora-1b-sample-packing-sequentially.yml

+sample_packing_sequentially: true
+curriculum_sampling: true


Could we add both of these to documentation docs/config.qmd to explain them?

src/axolotl/utils/samplers/multipack.py

NanoCode012 · 2025-03-24T08:20:31Z

src/axolotl/utils/samplers/multipack.py

+        if self.sequential and not isinstance(sampler, SequentialSampler):
+            LOG.warn(
+                "using sequential sample packing with non-sequential sampler, did you want to also enable curriculum_sampling?"
+            )
+


Similar to the earlier discussion, could we move this validation to the schema src/axolotl/utils/schemas/config.py and not here?

You can change the check to be: if sample_pack_sequential and not curriculum_sampling instead

DreamGenX · 2025-03-24T08:32:25Z

@winglian @NanoCode012 BTW, when i run train, I see both "using non-sequential sample packing" and "using sequential sample packing" in the logs (I think due to eval, which is strange), but during pre-process I see only using sequential sample packing as expected. I don't know why, because all occurences of the MultiPackBatchSampler get the sequential flag (even the instance used for eval), except for the one used for pre-train... but maybe I missed something.

DreamGenX force-pushed the sequential_packing branch from 10c7f21 to 3e30f45 Compare March 16, 2025 09:57

winglian force-pushed the sequential_packing branch from cd8460b to 8ffbda0 Compare March 21, 2025 14:20

winglian reviewed Mar 21, 2025

View reviewed changes

DreamGenX force-pushed the sequential_packing branch from 8ffbda0 to f70d84d Compare March 22, 2025 08:37

winglian force-pushed the sequential_packing branch from f70d84d to f107f54 Compare March 23, 2025 12:52

winglian requested a review from NanoCode012 March 24, 2025 07:51

NanoCode012 reviewed Mar 24, 2025

View reviewed changes

winglian added the ready to merge label Mar 26, 2025

add sequential sample packing

51f607d

DreamGenX force-pushed the sequential_packing branch from f107f54 to 51f607d Compare March 29, 2025 14:09

chore: lint

d6f12b6

winglian approved these changes Mar 31, 2025

View reviewed changes

winglian merged commit 4d36ecc into axolotl-ai-cloud:main Mar 31, 2025
13 checks passed

		sample_packing_sequentially: true
		curriculum_sampling: true

Uh oh!

Sequential sample packing #2404

Sequential sample packing #2404

Uh oh!

Conversation

DreamGenX commented Mar 11, 2025

Description

Motivation and Context

How has this been tested?

Uh oh!

winglian commented Mar 20, 2025

Uh oh!

DreamGenX commented Mar 20, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

winglian commented Mar 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NanoCode012 Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NanoCode012 Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DreamGenX commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

NanoCode012 Mar 24, 2025 •

edited

Loading

NanoCode012 Mar 24, 2025 •

edited

Loading