modelscope · Jintao-Huang · Jun 11, 2025 · May 13, 2025 · May 13, 2025 · May 13, 2025
diff --git a/docs/source/Instruction/Megatron-SWIFT训练.md b/docs/source/Instruction/Megatron-SWIFT训练.md
@@ -172,7 +172,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - seq_length: 默认为None，即设置为`max_length`。对数据集长度进行限制请使用基本参数中的`--max_length`控制，无需设置此参数。
 - use_cpu_initialization: 在cpu上初始化权重，默认为False。在进行HF和MCore权重转换时会被使用。
 - no_create_attention_mask_in_dataloader: 在dataloader中不创建attention mask，默认为True。
-- extra_megatron_kwargs: Additional parameters passed to Megatron, provided as a JSON object. Defaults to None.
+- extra_megatron_kwargs: 传入megatron的其他参数，使用json传递。默认为None。
 
 **学习率参数**:
 - 🔥lr: 初始学习率，最终会根据学习率预热策略和衰减策略决定每个迭代的学习率，默认为1e-5。
@@ -221,7 +221,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 
 **日志参数**:
 - log_params_norm: 记录参数的norm。默认为False。
-- log_throughput: 记录每个GPU的吞吐量。默认为True。
+- log_throughput: 记录每个GPU的吞吐量。默认为False。
   - 注意：在非packing情况下，log_throughput并不准确，因为`seq_length`并不等于真实序列长度。
 - tensorboard_log_interval: 记录到tensorboard的间隔（steps），默认为1。
 - tensorboard_queue_size: 队列长度（与磁盘IO相关），类似于写入的间隔。默认为50。
@@ -235,7 +235,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - wandb_save_dir: 本地保存 wandb 结果的路径。默认为''。
 
 **评估参数**:
-- 🔥eval_iters: 评估的迭代次数，默认为100。
+- 🔥eval_iters: 评估的迭代次数，默认为-1，根据验证数据集的数量设置合适的值。
+  - 注意：若使用流式数据集，该值需要手动设置。
 - 🔥eval_interval: 评估的间隔（steps），默认为None，即设置为save_interval。
 
 **混合精度参数**:
@@ -289,15 +290,31 @@ I am a language model developed by swift, you can call me swift-robot. How can I
 - moe_expert_capacity_factor: 每个专家的容量因子，None表示不会丢弃任何token。默认为None。
 - moe_shared_expert_overlap: 启用共享专家计算与调度器通信之间的重叠。如果不启用此选项，共享专家将在路由专家之后执行。仅在设置了`moe_shared_expert_intermediate_size`时有效。默认为False。
 
+**DPO参数**:
+- ref_load: ref_model的加载路径。默认为None，即设置为`load`。
+- beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高，表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo")，beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
+- rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数，用于控制损失函数中NLL项的权重（即SFT损失）。`loss = dpo_loss + rpo_alpha * nll_loss`。默认为1。
+- reference_free: 是否忽略提供的参考模型，并隐式地使用一个对所有响应赋予相等概率的参考模型。默认为False。
+- label_smoothing: 默认为0.。
+- f_divergence_type: 默认为`reverse_kl`。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer)。
+- loss_type: 默认为'sigmoid'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer)。
 
-### Megatron训练参数
+
+### 训练参数
 
 Megatron训练参数继承自Megatron参数和基本参数。基本参数的内容可以参考[这里](./命令行参数.md#基本参数)。此外还包括以下参数：
 
 - add_version: 在`save`上额外增加目录`'<版本号>-<时间戳>'`防止权重覆盖，默认为True。
-- 🔥packing: 是否使用序列packing，默认为False。
+- 🔥packing: 是否使用序列packing，默认为False。当前支持`megatron pt/sft`。
 - 🔥packing_cache: 指定 packing 缓存目录。默认值为`None`，表示缓存将存储在环境变量 `$MODELSCOPE_CACHE`所指定的路径下。在跨节点使用 packing 功能时，需确保所有节点的 packing 缓存路径共享且一致。你可以通过设置`MODELSCOPE_CACHE`环境变量，或在命令行中添加 `--packing_cache <shared_path>`参数来实现这一要求。
 - 🔥streaming: 流式读取并处理数据集，默认False。通常在处理大型数据集时，设置为True。更多流式的参数查看命令行参数文档。
 - lazy_tokenize: 默认为False。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（这可以避免在训练中出现报错）；设置为True，则在训练中对数据集进行tokenize（这可以节约内存）。
 - max_epochs: 训练到`max_epochs`时强制退出训练，并对权重进行验证和保存。该参数在使用流式数据集时很有用。默认为None。
   - 注意：如果你使用非流式数据集，该参数会为你自动计算train_iters，你不需要手动传入`train_iters`。
+
+
+### RLHF参数
+除了继承训练参数外，还支持以下参数：
+- rlhf_type: 默认为'dpo'。目前可选择为'dpo'。
+- loss_scale: 覆盖[基本参数](./命令行参数.md)中的loss_scale。默认为'last_round'。
+- calculate_per_token_loss: 覆盖Megatron参数，默认为False。
diff --git a/docs/source_en/Instruction/Megatron-SWIFT-Training.md b/docs/source_en/Instruction/Megatron-SWIFT-Training.md
@@ -175,7 +175,7 @@ The speed comparison of full-parameter training for Dense/MoE models using `mega
 seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the dataset length, please use the `--max_length` parameter in the basic arguments; there is no need to set this parameter.
 - use_cpu_initialization: Initializes weights on the CPU, default is False. Used during HF and MCore weight conversion.
 - no_create_attention_mask_in_dataloader: Does not create an attention mask in the dataloader, default is True.
-- extra_megatron_kwargs: 传入megatron的其他参数，使用json传递。默认为None。
+- extra_megatron_kwargs: Additional parameters passed to Megatron, provided as a JSON object. Defaults to None.
 
 **Learning Rate Parameters**:
 
@@ -229,7 +229,7 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
 **Logging Parameters**:
 
 - log_params_norm: Logs the norm of parameters. Default is False.
-- log_throughput: Logs throughput per GPU. Default is True.
+- log_throughput: Logs throughput per GPU. Default is False.
   - Note: In non-packing scenarios, log_throughput is not accurate because `seq_length` does not equal the actual sequence length.
 - tensorboard_log_interval: Interval (steps) for logging to TensorBoard, default is 1.
 - tensorboard_queue_size: Queue length (related to disk I/O), similar to write intervals. Default is 50.
@@ -244,7 +244,8 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
 
 **Evaluation Parameters**:
 
-- 🔥eval_iters: Number of evaluation iterations, default is 100.
+- 🔥eval_iters: The number of iterations for evaluation. Defaults to -1, and a suitable value will be set based on the size of the validation dataset.
+  - Note: If using a streaming dataset, this value needs to be set manually.
 - 🔥eval_interval: Evaluation interval (steps), default is None, meaning it will be set to save_interval.
 
 **Mixed Precision Parameters**:
@@ -301,14 +302,33 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
 - moe_expert_capacity_factor: Capacity factor for each expert, None means no tokens will be dropped. Default is None.
 - moe_shared_expert_overlap: Enable overlapping of shared expert computation with scheduler communication. If this option is not enabled, shared experts will execute after the routing experts. Only effective when `moe_shared_expert_intermediate_size` is set. Default is False.
 
-### Megatron Training Parameters
+**DPO Parameters**
+- ref_load: The path to load the reference model. Defaults to `None`, which means it will be set to `load`.
+- beta: Has the same meaning as in [TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig). It controls the degree of deviation from the reference model. A higher beta value indicates less deviation from the reference model. For the IPO loss function (`loss_type="ipo"`), beta is the regularization parameter as mentioned in the [paper](https://huggingface.co/papers/2310.12036). Default is 0.1.
+- rpo_alpha: A parameter from the [RPO paper](https://huggingface.co/papers/2404.19733) used to control the weight of the NLL term (i.e., SFT loss) in the loss function. The total loss is calculated as `loss = dpo_loss + rpo_alpha * nll_loss`. Default is 1.
+- reference_free: Whether to ignore the provided reference model and implicitly use a reference model that assigns equal probability to all responses. Default is `False`.
+- label_smoothing: Default is 0.
+- f_divergence_type: Default is `reverse_kl`. See the [TRL documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer) for possible values.
+- loss_type: Default is `'sigmoid'`. See the [TRL documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer) for possible values.
+
+
+### Training Parameters
 
 Megatron training parameters inherit from Megatron parameters and basic parameters. For information on basic parameters, see [here](./Command-line-parameters.md#base-arguments). Additionally, the following parameters are included:
 
 - add_version: Adds a directory `<version>-<timestamp>` to `save` to prevent overwriting weights, default is True.
-- 🔥packing: Whether to use sequence packing, defaults to False.
+- 🔥packing: Whether to use sequence packing, defaults to False. Currently supports `megatron pt/sft`.
 - 🔥packing_cache: Specifies the directory for packing cache. The default value is `None`, which means the cache will be stored in the path defined by the environment variable `$MODELSCOPE_CACHE`. When using the packing feature across multiple nodes, ensure that all nodes share the same packing cache directory. You can achieve this by setting the `MODELSCOPE_CACHE` environment variable or by adding the `--packing_cache <shared_path>` argument in the command line.
 - 🔥streaming: Stream reading and processing of the dataset, default is False. It is typically set to True when handling large datasets. For more information on streaming parameters, refer to the command-line parameters documentation.
 - lazy_tokenize: Default is False. If this parameter is set to False, all dataset samples are tokenized before training (this avoids errors during training); if set to True, tokenization occurs during training (this saves memory).
 - max_epochs: Forces the training to exit after reaching `max_epochs`, and performs validation and saving of the model weights. This parameter is especially useful when using a streaming dataset. Default is None.
   - Note: If you use a non-streaming dataset, this parameter will automatically calculate train_iters for you, so there is no need to pass `train_iters` manually.
+
+
+### RLHF Parameters
+
+In addition to inheriting the training parameters, the following parameters are also supported:
+
+- rlhf_type: Default is 'dpo'. Currently, only 'dpo' is available.
+- loss_scale: Overrides the `loss_scale` in [basic parameters](https://idealab.alibaba-inc.com/command_line_arguments.md). Default is 'last_round'.
+- calculate_per_token_loss: Overrides the Megatron parameter. Default is False.
diff --git a/examples/train/megatron/moe.sh → examples/train/megatron/moe/moe.sh b/examples/train/megatron/moe.sh → examples/train/megatron/moe/moe.sh
diff --git a/examples/train/megatron/qwen3_moe.sh → examples/train/megatron/moe/qwen3_moe.sh b/examples/train/megatron/qwen3_moe.sh → examples/train/megatron/moe/qwen3_moe.sh
diff --git a/examples/train/megatron/rlhf/dpo.sh b/examples/train/megatron/rlhf/dpo.sh
@@ -0,0 +1,33 @@
+# 4 * 60GiB
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+NPROC_PER_NODE=4 \
+CUDA_VISIBLE_DEVICES=0,1,2,3 \
+megatron rlhf \
+    --rlhf_type dpo \
+    --load Qwen3-8B-Base-mcore \
+    --dataset 'hjh0119/shareAI-Llama3-DPO-zh-en-emoji#20000' \
+    --tensor_model_parallel_size 4 \
+    --micro_batch_size 8 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --max_epochs 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_iters 50 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen3-8B-Base \
+    --eval_interval 200 \
+    --save_interval 200 \
+    --max_length 8192 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --attention_backend flash \
+    --beta 0.1 \
+    --rpo_alpha 1 \
+    --loss_type sigmoid
diff --git a/examples/train/megatron/rlhf/moe.sh b/examples/train/megatron/rlhf/moe.sh
@@ -0,0 +1,36 @@
+# 8 * 64GiB
+NPROC_PER_NODE=8 \
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+megatron rlhf \
+    --rlhf_type dpo \
+    --load Qwen1.5-MoE-A2.7B-mcore \
+    --dataset 'hjh0119/shareAI-Llama3-DPO-zh-en-emoji#20000' \
+    --tensor_model_parallel_size 2 \
+    --expert_model_parallel_size 4 \
+    --moe_grouped_gemm true \
+    --moe_shared_expert_overlap true \
+    --moe_aux_loss_coeff 0.01 \
+    --micro_batch_size 4 \
+    --global_batch_size 16 \
+    --recompute_granularity full \
+    --recompute_method uniform \
+    --recompute_num_layers 1 \
+    --max_epochs 1 \
+    --finetune true \
+    --cross_entropy_loss_fusion true \
+    --lr 1e-5 \
+    --lr_warmup_iters 100 \
+    --min_lr 1e-6 \
+    --save megatron_output/Qwen1.5-MoE-A2.7B \
+    --eval_interval 200 \
+    --save_interval 200 \
+    --max_length 8192 \
+    --num_workers 8 \
+    --dataset_num_proc 8 \
+    --no_save_optim true \
+    --no_save_rng true \
+    --sequence_parallel true \
+    --attention_backend flash \
+    --beta 0.1 \
+    --rpo_alpha 1 \
+    --loss_type sigmoid
diff --git a/swift/cli/_megatron/main.py b/swift/cli/_megatron/main.py
@@ -9,6 +9,7 @@
 ROUTE_MAPPING: Dict[str, str] = {
     'pt': 'swift.cli._megatron.pt',
     'sft': 'swift.cli._megatron.sft',
+    'rlhf': 'swift.cli._megatron.rlhf',
 }
 
 

diff --git a/swift/cli/_megatron/rlhf.py b/swift/cli/_megatron/rlhf.py
@@ -0,0 +1,5 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+from swift.megatron import megatron_rlhf_main
+
+if __name__ == '__main__':
+    megatron_rlhf_main()
diff --git a/swift/megatron/__init__.py b/swift/megatron/__init__.py
@@ -12,15 +12,15 @@
 from swift.utils.import_utils import _LazyModule
 
 if TYPE_CHECKING:
-    from .train import megatron_sft_main, megatron_pt_main
+    from .train import megatron_sft_main, megatron_pt_main, megatron_rlhf_main
     from .utils import convert_hf2mcore, convert_mcore2hf
-    from .argument import MegatronTrainArguments
+    from .argument import MegatronTrainArguments, MegatronRLHFArguments
     from .model import MegatronModelType, MegatronModelMeta, get_megatron_model_meta, register_megatron_model
 else:
     _import_structure = {
-        'train': ['megatron_sft_main', 'megatron_pt_main'],
+        'train': ['megatron_sft_main', 'megatron_pt_main', 'megatron_rlhf_main'],
         'utils': ['convert_hf2mcore', 'convert_mcore2hf'],
-        'argument': ['MegatronTrainArguments'],
+        'argument': ['MegatronTrainArguments', 'MegatronRLHFArguments'],
         'model': ['MegatronModelType', 'MegatronModelMeta', 'get_megatron_model_meta', 'register_megatron_model']
     }
 

diff --git a/swift/megatron/argument/__init__.py b/swift/megatron/argument/__init__.py
@@ -1,3 +1,4 @@
 # Copyright (c) Alibaba, Inc. and its affiliates.
 from .megatron_args import MegatronArguments
+from .rlhf_args import MegatronRLHFArguments
 from .train_args import MegatronTrainArguments
diff --git a/swift/megatron/argument/megatron_args.py b/swift/megatron/argument/megatron_args.py
@@ -14,7 +14,19 @@
 
 
 @dataclass
-class ExtraMegatronArguments:
+class RLHFMegatronArgumentsMixin:
+    ref_load: Optional[str] = None
+
+    beta: float = 0.1
+    rpo_alpha: float = 1.
+    reference_free: bool = False
+    label_smoothing: float = 0.
+    f_divergence_type: str = 'reverse_kl'
+    loss_type: str = 'sigmoid'
+
+
+@dataclass
+class ExtraMegatronArguments(RLHFMegatronArgumentsMixin):
     padded_vocab_size: Optional[int] = None
     rope_scaling: Optional[Union[dict, str]] = None
     torch_dtype: Optional[torch.dtype] = None
@@ -150,7 +162,7 @@ class MegatronArguments(ExtraMegatronArguments):
 
     # logging
     log_params_norm: bool = False
-    log_throughput: bool = True
+    log_throughput: bool = False
     tensorboard_log_interval: int = 1
     tensorboard_queue_size: int = 50
     log_timers_to_tensorboard: bool = True
@@ -163,7 +175,7 @@ class MegatronArguments(ExtraMegatronArguments):
     wandb_save_dir: Optional[str] = None
 
     # evaluate
-    eval_iters: int = 100
+    eval_iters: int = -1
     eval_interval: Optional[int] = None
 
     # other

diff --git a/swift/megatron/argument/rlhf_args.py b/swift/megatron/argument/rlhf_args.py
@@ -0,0 +1,13 @@
+# Copyright (c) Alibaba, Inc. and its affiliates.
+from dataclasses import dataclass
+from typing import Literal
+
+from .train_args import MegatronTrainArguments
+
+
+@dataclass
+class MegatronRLHFArguments(MegatronTrainArguments):
+    rlhf_type: Literal['dpo'] = 'dpo'
+    loss_scale: str = 'last_round'
+
+    calculate_per_token_loss: bool = False