[megatron/dpo] fix megatron packing_cache & update DPOTrainer (#4556)

Jintao-Huang · web-flow · commit 19b34bc5c9ee · 2025-06-11T14:05:37.000+08:00
diff --git a/docs/source/Instruction/Megatron-SWIFT训练.md b/docs/source/Instruction/Megatron-SWIFT训练.md
@@ -172,7 +172,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - seq_length: 默认为None，即设置为`max_length`。对数据集长度进行限制请使用基本参数中的`--max_length`控制，无需设置此参数。
 - use_cpu_initialization: 在cpu上初始化权重，默认为False。在进行HF和MCore权重转换时会被使用。
 - no_create_attention_mask_in_dataloader: 在dataloader中不创建attention mask，默认为True。
-- extra_megatron_kwargs: Additional parameters passed to Megatron, provided as a JSON object. Defaults to None.
+- extra_megatron_kwargs: 传入megatron的其他参数，使用json传递。默认为None。
 
 **学习率参数**:
 - 🔥lr: 初始学习率，最终会根据学习率预热策略和衰减策略决定每个迭代的学习率，默认为1e-5。
@@ -221,7 +221,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 
 **日志参数**:
 - log_params_norm: 记录参数的norm。默认为False。
-- log_throughput: 记录每个GPU的吞吐量。默认为True。
+- log_throughput: 记录每个GPU的吞吐量。默认为False。
   - 注意：在非packing情况下，log_throughput并不准确，因为`seq_length`并不等于真实序列长度。
 - tensorboard_log_interval: 记录到tensorboard的间隔（steps），默认为1。
 - tensorboard_queue_size: 队列长度（与磁盘IO相关），类似于写入的间隔。默认为50。
@@ -235,7 +235,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - wandb_save_dir: 本地保存 wandb 结果的路径。默认为''。
 
 **评估参数**:
-- 🔥eval_iters: 评估的迭代次数，默认为100。
+- 🔥eval_iters: 评估的迭代次数，默认为-1，根据验证数据集的数量设置合适的值。
+  - 注意：若使用流式数据集，该值需要手动设置。
 - 🔥eval_interval: 评估的间隔（steps），默认为None，即设置为save_interval。
 
 **混合精度参数**:
@@ -295,7 +296,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 Megatron训练参数继承自Megatron参数和基本参数。基本参数的内容可以参考[这里](./命令行参数.md#基本参数)。此外还包括以下参数：
 
 - add_version: 在`save`上额外增加目录`'<版本号>-<时间戳>'`防止权重覆盖，默认为True。
-- 🔥packing: 是否使用序列packing，默认为False。
+- 🔥packing: 是否使用序列packing，默认为False。当前支持`megatron pt/sft`。
 - 🔥packing_cache: 指定 packing 缓存目录。默认值为`None`，表示缓存将存储在环境变量 `$MODELSCOPE_CACHE`所指定的路径下。在跨节点使用 packing 功能时，需确保所有节点的 packing 缓存路径共享且一致。你可以通过设置`MODELSCOPE_CACHE`环境变量，或在命令行中添加 `--packing_cache <shared_path>`参数来实现这一要求。
 - 🔥streaming: 流式读取并处理数据集，默认False。通常在处理大型数据集时，设置为True。更多流式的参数查看命令行参数文档。
 - lazy_tokenize: 默认为False。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（这可以避免在训练中出现报错）；设置为True，则在训练中对数据集进行tokenize（这可以节约内存）。
diff --git a/docs/source_en/Instruction/Megatron-SWIFT-Training.md b/docs/source_en/Instruction/Megatron-SWIFT-Training.md
@@ -175,7 +175,7 @@ The speed comparison of full-parameter training for Dense/MoE models using `mega
 seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the dataset length, please use the `--max_length` parameter in the basic arguments; there is no need to set this parameter.
 - use_cpu_initialization: Initializes weights on the CPU, default is False. Used during HF and MCore weight conversion.
 - no_create_attention_mask_in_dataloader: Does not create an attention mask in the dataloader, default is True.
-- extra_megatron_kwargs: 传入megatron的其他参数，使用json传递。默认为None。
+- extra_megatron_kwargs: Additional parameters passed to Megatron, provided as a JSON object. Defaults to None.
 
 **Learning Rate Parameters**:
 
@@ -229,7 +229,7 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
 **Logging Parameters**:
 
 - log_params_norm: Logs the norm of parameters. Default is False.
-- log_throughput: Logs throughput per GPU. Default is True.
+- log_throughput: Logs throughput per GPU. Default is False.
   - Note: In non-packing scenarios, log_throughput is not accurate because `seq_length` does not equal the actual sequence length.
 - tensorboard_log_interval: Interval (steps) for logging to TensorBoard, default is 1.
 - tensorboard_queue_size: Queue length (related to disk I/O), similar to write intervals. Default is 50.
@@ -244,7 +244,8 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
 
 **Evaluation Parameters**:
 
-- 🔥eval_iters: Number of evaluation iterations, default is 100.
+- 🔥eval_iters: The number of iterations for evaluation. Defaults to -1, and a suitable value will be set based on the size of the validation dataset.
+  - Note: If using a streaming dataset, this value needs to be set manually.
 - 🔥eval_interval: Evaluation interval (steps), default is None, meaning it will be set to save_interval.
 
 **Mixed Precision Parameters**:
@@ -306,7 +307,7 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
 Megatron training parameters inherit from Megatron parameters and basic parameters. For information on basic parameters, see [here](./Command-line-parameters.md#base-arguments). Additionally, the following parameters are included:
 
 - add_version: Adds a directory `<version>-<timestamp>` to `save` to prevent overwriting weights, default is True.
-- 🔥packing: Whether to use sequence packing, defaults to False.
+- 🔥packing: Whether to use sequence packing, defaults to False. Currently supports `megatron pt/sft`.
 - 🔥packing_cache: Specifies the directory for packing cache. The default value is `None`, which means the cache will be stored in the path defined by the environment variable `$MODELSCOPE_CACHE`. When using the packing feature across multiple nodes, ensure that all nodes share the same packing cache directory. You can achieve this by setting the `MODELSCOPE_CACHE` environment variable or by adding the `--packing_cache <shared_path>` argument in the command line.
 - 🔥streaming: Stream reading and processing of the dataset, default is False. It is typically set to True when handling large datasets. For more information on streaming parameters, refer to the command-line parameters documentation.
 - lazy_tokenize: Default is False. If this parameter is set to False, all dataset samples are tokenized before training (this avoids errors during training); if set to True, tokenization occurs during training (this saves memory).
diff --git a/swift/llm/model/model/qwen.py b/swift/llm/model/model/qwen.py
@@ -936,4 +936,5 @@ def update(self, key_states: torch.Tensor, value_states: torch.Tensor, layer_idx
         ],
         TemplateType.qwen3_emb,
         get_model_tokenizer_with_flash_attn,
+        additional_saved_files=['config_sentence_transformers.json', '1_Pooling', 'modules.json'],
         architectures=['Qwen3ForCausalLM']))
diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py
@@ -150,7 +150,7 @@ class MegatronArguments(ExtraMegatronArguments):
 
     # logging
     log_params_norm: bool = False
-    log_throughput: bool = True
+    log_throughput: bool = False
     tensorboard_log_interval: int = 1
     tensorboard_queue_size: int = 50
     log_timers_to_tensorboard: bool = True
@@ -163,7 +163,7 @@ class MegatronArguments(ExtraMegatronArguments):
     wandb_save_dir: Optional[str] = None
 
     # evaluate
-    eval_iters: int = 100
+    eval_iters: int = -1
     eval_interval: Optional[int] = None
 
     # other
diff --git a/swift/megatron/train/sft.py b/swift/megatron/train/sft.py
@@ -37,19 +37,28 @@ def __init__(self, args: Union[List[str], MegatronTrainArguments, None] = None)
         args.save_args(args.save)
 
     @contextmanager
-    def _get_train_iters(self, train_dataset):
-        from megatron.training import training
+    def _get_iters(self, train_dataset, val_dataset):
         origin_initialize_megatron = training.initialize_megatron
 
         def initialize_megatron(*_args, **kwargs):
             res = origin_initialize_megatron(*_args, **kwargs)
             args = get_args()
-            if args.train_iters is None and hasattr(train_dataset, '__len__'):
-                data_parallel_size = mpu.get_data_parallel_world_size()
-                step_batch_size = \
-                    args.micro_batch_size * data_parallel_size
-                dataset_sample = len(train_dataset) // step_batch_size * step_batch_size
-                args.train_iters = (dataset_sample * args.max_epochs // args.global_batch_size) + 1
+            data_parallel_size = mpu.get_data_parallel_world_size()
+            step_batch_size = args.micro_batch_size * data_parallel_size
+            if args.train_iters is None:
+                if hasattr(train_dataset, '__len__'):
+                    dataset_sample = len(train_dataset) // step_batch_size * step_batch_size
+                    args.train_iters = (dataset_sample * args.max_epochs // args.global_batch_size) + 1
+                else:
+                    raise ValueError(
+                        'You are using a streaming training dataset. Please explicitly specify `--train_iters`.')
+            if val_dataset is not None and args.eval_iters < 0:
+                if hasattr(val_dataset, '__len__'):
+                    dataset_sample = len(val_dataset) // step_batch_size * step_batch_size
+                    args.eval_iters = max(dataset_sample // args.global_batch_size, 1)
+                else:
+                    raise ValueError(
+                        'You are using a streaming validation dataset. Please explicitly specify `--eval_iters`.')
             return res
 
         training.initialize_megatron = initialize_megatron
@@ -136,7 +145,7 @@ def run(self):
         logging_path = os.path.join(args.save, 'logging.jsonl')
         logger.info(f'The logging file will be saved in: {logging_path}')
         try:
-            with patch_megatron_data_collator(data_collator), self._get_train_iters(train_dataset):
+            with patch_megatron_data_collator(data_collator), self._get_iters(train_dataset, val_dataset):
                 extra_args_provider = args.megatron_model_meta.extra_args_provider
                 pretrain(
                     datasets_provider,
diff --git a/swift/megatron/train/utils.py b/swift/megatron/train/utils.py
@@ -83,11 +83,13 @@ def _broadcast(item):
             _broadcast(batch['position_ids'])
 
         elif mpu.is_pipeline_first_stage():
+            batch['labels'] = None
             _broadcast(batch['input_ids'])
             _broadcast(batch['attention_mask'])
             _broadcast(batch['position_ids'])
 
         elif mpu.is_pipeline_last_stage():
+            batch['input_ids'] = None
             _broadcast(batch['labels'])
             _broadcast(batch['attention_mask'])
             _broadcast(batch['position_ids'])
diff --git a/swift/trainers/mixin.py b/swift/trainers/mixin.py
@@ -223,11 +223,6 @@ def _save_model(self, output_dir: Optional[str] = None, state_dict=None):
         else:
             if self.model.__class__.__name__ != 'SentenceTransformer':
                 self.model.save_pretrained(output_dir, state_dict=state_dict, safe_serialization=save_safetensors)
-                # For embedding models, they should copy extra sentence_transformers files
-                from swift.utils import copy_files_by_pattern
-                copy_files_by_pattern(self.model.model_dir, output_dir, 'config_sentence_transformers.json')
-                copy_files_by_pattern(self.model.model_dir, output_dir, '1_Pooling/config.json')
-                copy_files_by_pattern(self.model.model_dir, output_dir, 'modules.json')
             else:
 
                 @contextmanager
diff --git a/swift/trainers/rlhf_trainer/dpo_trainer.py b/swift/trainers/rlhf_trainer/dpo_trainer.py
@@ -39,17 +39,11 @@ def __init__(self,
 
         super().__init__(model, ref_model, *_args, **kwargs)
 
-    def get_nll_loss(self, logits, labels):
-        # Flatten the tokens
-        loss_fct = nn.CrossEntropyLoss(ignore_index=self.label_pad_token_id)
-        logits = logits.view(-1, logits.shape[-1])
-        labels = labels.view(-1)
-        # Enable model parallelism
-        labels = labels.to(logits.device)
-        return loss_fct(logits, labels)
-
     def concatenated_forward(
-        self, model: nn.Module, batch: Dict[str, Union[List, torch.LongTensor]], **kwargs
+        self,
+        model: nn.Module,
+        batch: Dict[str, Union[List, torch.LongTensor]],
+        is_ref_model: bool = False
     ) -> Tuple[torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor, torch.FloatTensor]:
         batch = batch.copy()
         labels = batch.pop('labels', None)
@@ -76,7 +70,9 @@ def concatenated_forward(
         if not self.is_encoder_decoder and self.template.sequence_parallel_size == 1:
             # Shift so that tokens < n predict n
             labels = torch.roll(labels, shifts=-1, dims=1)
-        per_token_logps, mean_all_logits, loss_mask = self.get_per_token_logps(all_logits, labels)
+        per_token_logps, mean_all_logits, loss_mask = self.get_per_token_logps(
+            all_logits, labels, label_pad_token_id=self.label_pad_token_id)
+        origin_per_token_logps = per_token_logps
         if self.loss_type == 'ipo':
             size_completion = loss_mask.sum(dim=-1)
             per_token_logps = per_token_logps / size_completion
@@ -90,15 +86,17 @@ def concatenated_forward(
                 all_logps[i] = per_token_logps[:, start:end].sum()
             num_examples = all_logps.shape[0] // 2
             num_tokens = cu_seqlens[num_examples]
-            output['nll_loss'] = self.get_nll_loss(all_logits[:, :num_tokens], labels[:, :num_tokens])
+            if not is_ref_model:
+                output['nll_loss'] = -origin_per_token_logps[:, :num_tokens][loss_mask[:, :num_tokens]].mean()
             output['chosen_logps'] = all_logps[:num_examples]
             output['rejected_logps'] = all_logps[num_examples:]
             output['mean_chosen_logits'] = mean_all_logits[:, :num_tokens][loss_mask[:, :num_tokens]].mean()
             output['mean_rejected_logits'] = mean_all_logits[:, num_tokens:][loss_mask[:, num_tokens:]].mean()
         else:
             all_logps = per_token_logps.sum(-1)
             num_examples = labels.shape[0] // 2
-            output['nll_loss'] = self.get_nll_loss(all_logits[:num_examples], labels[:num_examples])
+            if not is_ref_model:
+                output['nll_loss'] = -origin_per_token_logps[:num_examples][loss_mask[:num_examples]].mean()
             output['chosen_logps'] = all_logps[:num_examples]
             output['rejected_logps'] = all_logps[num_examples:]
             output['mean_chosen_logits'] = mean_all_logits[:num_examples][loss_mask[:num_examples]].mean()
@@ -107,15 +105,16 @@ def concatenated_forward(
             output['aux_loss'] = outputs.aux_loss
         return output
 
+    @staticmethod
     def get_per_token_logps(
-        self,
         logits: torch.FloatTensor,
         labels: torch.LongTensor,
+        label_pad_token_id=-100,
     ) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
         if logits.shape[:-1] != labels.shape:
             raise ValueError(f'Logits (batch and sequence length dim) {logits.shape[:-1]}'
                              'and labels must have the same shape {labels.shape}')
-        loss_mask = labels != self.label_pad_token_id
+        loss_mask = labels != label_pad_token_id
         labels = labels.clone()
         labels[~loss_mask] = 0
         # https://github.com/huggingface/trl/pull/2799
diff --git a/swift/trainers/sequence_parallel/ulysses.py b/swift/trainers/sequence_parallel/ulysses.py
@@ -176,13 +176,13 @@ def old_policy(self):
 
 
 # For DPO
-def get_per_token_logps(self,
-                        logits: torch.FloatTensor,
+def get_per_token_logps(logits: torch.FloatTensor,
                         labels: torch.LongTensor,
+                        label_pad_token_id=-100,
                         ulysses=None) -> Tuple[torch.Tensor, torch.Tensor, torch.Tensor]:
     if labels.shape[1] > logits.shape[1]:
         _, _, labels, _, _, _ = ulysses.pad_and_split_inputs(None, None, labels, None, None, None)
-    loss_mask = labels != self.label_pad_token_id
+    loss_mask = labels != label_pad_token_id
     labels = labels.clone()  # No need to shift, pad and split has shifted the inputs.
     labels[~loss_mask] = 0
     labels = labels.to(logits.device)
@@ -840,12 +840,7 @@ def prepare_trainer(self, trainer):
         elif trainer.__class__.__name__ == 'DPOTrainer':
             trainer._origin_prepare_inputs = trainer._prepare_inputs
             trainer._prepare_inputs = MethodType(partial(_prepare_inputs, ulysses=self), trainer)
-            trainer.get_per_token_logps = MethodType(partial(get_per_token_logps, ulysses=self), trainer)
-
-            def rlhf_loss_scale_sp_func(_, *args, **kwargs):
-                return loss_scale_sp_func(*args, ulysses=self, **kwargs)
-
-            trainer.get_nll_loss = MethodType(rlhf_loss_scale_sp_func, trainer)
+            trainer.get_per_token_logps = partial(get_per_token_logps, ulysses=self)
 
         elif trainer.__class__.__name__ == 'GRPOTrainer':
             assert version.parse(trl.__version__) >= version.parse('0.18.0')
diff --git a/swift/utils/torch_utils.py b/swift/utils/torch_utils.py
@@ -413,21 +413,18 @@ def check_shared_disk(error, cache_dir: Optional[str] = None):
     os.makedirs(cache_dir, exist_ok=True)
     tmp_path = os.path.join(cache_dir, 'check_shared_disk.tmp')
     is_shared_disk = True
-    with safe_ddp_context(None, True):
-        if os.path.exists(tmp_path):
-            os.remove(tmp_path)
+
     try:
         with safe_ddp_context(None, True):
             if is_master():
                 with open(tmp_path, 'w'):
                     pass
-            else:
-                if not os.path.exists(tmp_path):
-                    is_shared_disk = False
+            if not os.path.exists(tmp_path):
+                is_shared_disk = False
+        shared_state = [None] * dist.get_world_size()
+        dist.all_gather_object(shared_state, is_shared_disk)
     finally:
         if is_master() and os.path.exists(tmp_path):
             os.remove(tmp_path)
-    shared_state = [None] * dist.get_world_size()
-    dist.all_gather_object(shared_state, is_shared_disk)
     if not all(shared_state):
         raise error