Skip to content

[megatron] support DPO #4193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 49 commits into from
Jun 11, 2025
Merged
Show file tree
Hide file tree
Changes from 43 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
2200de1
support megatron dpo
Jintao-Huang May 13, 2025
3a21dca
update
Jintao-Huang May 13, 2025
21d044a
update
Jintao-Huang May 13, 2025
29714f1
update
Jintao-Huang May 13, 2025
48bdef9
update
Jintao-Huang May 13, 2025
6806521
update
Jintao-Huang May 13, 2025
cc6b0f6
Merge branch 'main' into support_megatron_dpo
Jintao-Huang May 13, 2025
ebfe9d1
update
Jintao-Huang May 13, 2025
5802d8e
update
Jintao-Huang May 13, 2025
fef4b9b
update
Jintao-Huang May 13, 2025
0a69513
Merge branch 'main' into support_megatron_dpo
Jintao-Huang May 14, 2025
7a595d3
update
Jintao-Huang May 14, 2025
9dde75a
update
Jintao-Huang May 15, 2025
2552bda
Merge branch 'main' into support_megatron_dpo
Jintao-Huang May 28, 2025
007c9ed
update
Jintao-Huang May 28, 2025
b56b601
Merge branch 'main' into support_megatron_dpo
Jintao-Huang Jun 1, 2025
96e5f3a
Merge branch 'main' into support_megatron_dpo
Jintao-Huang Jun 1, 2025
54244e5
Merge branch 'main' into support_megatron_dpo
Jintao-Huang Jun 3, 2025
3382b0c
Merge branch 'main' into support_megatron_dpo
Jintao-Huang Jun 5, 2025
1f0f411
update
Jintao-Huang Jun 5, 2025
615befc
update
Jintao-Huang Jun 5, 2025
ab5bdfa
update
Jintao-Huang Jun 5, 2025
515d476
update
Jintao-Huang Jun 5, 2025
15db9af
update
Jintao-Huang Jun 6, 2025
f8d29a7
update shell
Jintao-Huang Jun 6, 2025
c96cef7
update
Jintao-Huang Jun 6, 2025
b6fc6d9
update
Jintao-Huang Jun 6, 2025
95f74fa
update
Jintao-Huang Jun 6, 2025
ac1c33b
update
Jintao-Huang Jun 6, 2025
7a75d93
update
Jintao-Huang Jun 6, 2025
d981bc5
update
Jintao-Huang Jun 6, 2025
9b089b5
fix dpo emoji dataset
Jintao-Huang Jun 7, 2025
93a7486
Merge branch 'fix_emoji_dpo_dataset' into support_megatron_dpo
Jintao-Huang Jun 7, 2025
db6fc3a
Merge branch 'main' into support_megatron_dpo
Jintao-Huang Jun 9, 2025
a6067bd
update
Jintao-Huang Jun 9, 2025
bd46a59
update
Jintao-Huang Jun 9, 2025
f8c8dcf
Merge branch 'main' into support_megatron_dpo
Jintao-Huang Jun 9, 2025
568b3aa
update
Jintao-Huang Jun 9, 2025
0dff938
update
Jintao-Huang Jun 9, 2025
c57a29e
update
Jintao-Huang Jun 10, 2025
79eca73
update
Jintao-Huang Jun 10, 2025
2ad6fd2
update
Jintao-Huang Jun 11, 2025
c330851
update
Jintao-Huang Jun 11, 2025
f3b5003
update
Jintao-Huang Jun 11, 2025
13c8696
Merge branch 'main' into support_megatron_dpo
Jintao-Huang Jun 11, 2025
d110ed3
fix
Jintao-Huang Jun 11, 2025
e92f79c
update
Jintao-Huang Jun 11, 2025
4fe4e4e
update
Jintao-Huang Jun 11, 2025
7e8df50
update
Jintao-Huang Jun 11, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 22 additions & 5 deletions docs/source/Instruction/Megatron-SWIFT训练.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,7 +172,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- seq_length: 默认为None,即设置为`max_length`。对数据集长度进行限制请使用基本参数中的`--max_length`控制,无需设置此参数。
- use_cpu_initialization: 在cpu上初始化权重,默认为False。在进行HF和MCore权重转换时会被使用。
- no_create_attention_mask_in_dataloader: 在dataloader中不创建attention mask,默认为True。
- extra_megatron_kwargs: Additional parameters passed to Megatron, provided as a JSON object. Defaults to None.
- extra_megatron_kwargs: 传入megatron的其他参数,使用json传递。默认为None。

**学习率参数**:
- 🔥lr: 初始学习率,最终会根据学习率预热策略和衰减策略决定每个迭代的学习率,默认为1e-5。
Expand Down Expand Up @@ -221,7 +221,7 @@ I am a language model developed by swift, you can call me swift-robot. How can I

**日志参数**:
- log_params_norm: 记录参数的norm。默认为False。
- log_throughput: 记录每个GPU的吞吐量。默认为True
- log_throughput: 记录每个GPU的吞吐量。默认为False
- 注意:在非packing情况下,log_throughput并不准确,因为`seq_length`并不等于真实序列长度。
- tensorboard_log_interval: 记录到tensorboard的间隔(steps),默认为1。
- tensorboard_queue_size: 队列长度(与磁盘IO相关),类似于写入的间隔。默认为50。
Expand All @@ -235,7 +235,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- wandb_save_dir: 本地保存 wandb 结果的路径。默认为''。

**评估参数**:
- 🔥eval_iters: 评估的迭代次数,默认为100。
- 🔥eval_iters: 评估的迭代次数,默认为-1,根据验证数据集的数量设置合适的值。
- 注意:若使用流式数据集,该值需要手动设置。
- 🔥eval_interval: 评估的间隔(steps),默认为None,即设置为save_interval。

**混合精度参数**:
Expand Down Expand Up @@ -289,15 +290,31 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- moe_expert_capacity_factor: 每个专家的容量因子,None表示不会丢弃任何token。默认为None。
- moe_shared_expert_overlap: 启用共享专家计算与调度器通信之间的重叠。如果不启用此选项,共享专家将在路由专家之后执行。仅在设置了`moe_shared_expert_intermediate_size`时有效。默认为False。

**DPO参数**:
- ref_load: ref_model的加载路径。默认为None,即设置为`load`。
- beta: 含义与[TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig)相同。控制与参考模型偏差程度的参数。beta值越高,表示与参考模型的偏差越小。对于 IPO 损失函数 (loss_type="ipo"),beta是[论文](https://huggingface.co/papers/2310.12036)中所指的正则化参数。默认为0.1。
- rpo_alpha: 来自[RPO 论文](https://huggingface.co/papers/2404.19733)中的参数,用于控制损失函数中NLL项的权重(即SFT损失)。`loss = dpo_loss + rpo_alpha * nll_loss`。默认为1。
- reference_free: 是否忽略提供的参考模型,并隐式地使用一个对所有响应赋予相等概率的参考模型。默认为False。
- label_smoothing: 默认为0.。
- f_divergence_type: 默认为`reverse_kl`。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer)。
- loss_type: 默认为'sigmoid'。可选值参考[TRL文档](https://huggingface.co/docs/trl/main/en/dpo_trainer)。

### Megatron训练参数

### 训练参数

Megatron训练参数继承自Megatron参数和基本参数。基本参数的内容可以参考[这里](./命令行参数.md#基本参数)。此外还包括以下参数:

- add_version: 在`save`上额外增加目录`'<版本号>-<时间戳>'`防止权重覆盖,默认为True。
- 🔥packing: 是否使用序列packing,默认为False。
- 🔥packing: 是否使用序列packing,默认为False。当前支持`megatron pt/sft`。
- 🔥packing_cache: 指定 packing 缓存目录。默认值为`None`,表示缓存将存储在环境变量 `$MODELSCOPE_CACHE`所指定的路径下。在跨节点使用 packing 功能时,需确保所有节点的 packing 缓存路径共享且一致。你可以通过设置`MODELSCOPE_CACHE`环境变量,或在命令行中添加 `--packing_cache <shared_path>`参数来实现这一要求。
- 🔥streaming: 流式读取并处理数据集,默认False。通常在处理大型数据集时,设置为True。更多流式的参数查看命令行参数文档。
- lazy_tokenize: 默认为False。若该参数设置为False,则在训练之前对所有的数据集样本进行tokenize(这可以避免在训练中出现报错);设置为True,则在训练中对数据集进行tokenize(这可以节约内存)。
- max_epochs: 训练到`max_epochs`时强制退出训练,并对权重进行验证和保存。该参数在使用流式数据集时很有用。默认为None。
- 注意:如果你使用非流式数据集,该参数会为你自动计算train_iters,你不需要手动传入`train_iters`。


### RLHF参数
除了继承训练参数外,还支持以下参数:
- rlhf_type: 默认为'dpo'。目前可选择为'dpo'。
- loss_scale: 覆盖[基本参数](./命令行参数.md)中的loss_scale。默认为'last_round'。
- calculate_per_token_loss: 覆盖Megatron参数,默认为False。
30 changes: 25 additions & 5 deletions docs/source_en/Instruction/Megatron-SWIFT-Training.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ The speed comparison of full-parameter training for Dense/MoE models using `mega
seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the dataset length, please use the `--max_length` parameter in the basic arguments; there is no need to set this parameter.
- use_cpu_initialization: Initializes weights on the CPU, default is False. Used during HF and MCore weight conversion.
- no_create_attention_mask_in_dataloader: Does not create an attention mask in the dataloader, default is True.
- extra_megatron_kwargs: 传入megatron的其他参数,使用json传递。默认为None。
- extra_megatron_kwargs: Additional parameters passed to Megatron, provided as a JSON object. Defaults to None.

**Learning Rate Parameters**:

Expand Down Expand Up @@ -229,7 +229,7 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
**Logging Parameters**:

- log_params_norm: Logs the norm of parameters. Default is False.
- log_throughput: Logs throughput per GPU. Default is True.
- log_throughput: Logs throughput per GPU. Default is False.
- Note: In non-packing scenarios, log_throughput is not accurate because `seq_length` does not equal the actual sequence length.
- tensorboard_log_interval: Interval (steps) for logging to TensorBoard, default is 1.
- tensorboard_queue_size: Queue length (related to disk I/O), similar to write intervals. Default is 50.
Expand All @@ -244,7 +244,8 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the

**Evaluation Parameters**:

- 🔥eval_iters: Number of evaluation iterations, default is 100.
- 🔥eval_iters: The number of iterations for evaluation. Defaults to -1, and a suitable value will be set based on the size of the validation dataset.
- Note: If using a streaming dataset, this value needs to be set manually.
- 🔥eval_interval: Evaluation interval (steps), default is None, meaning it will be set to save_interval.

**Mixed Precision Parameters**:
Expand Down Expand Up @@ -301,14 +302,33 @@ seq_length: Defaults to None, meaning it is set to `max_length`. To restrict the
- moe_expert_capacity_factor: Capacity factor for each expert, None means no tokens will be dropped. Default is None.
- moe_shared_expert_overlap: Enable overlapping of shared expert computation with scheduler communication. If this option is not enabled, shared experts will execute after the routing experts. Only effective when `moe_shared_expert_intermediate_size` is set. Default is False.

### Megatron Training Parameters
**DPO Parameters**
- ref_load: The path to load the reference model. Defaults to `None`, which means it will be set to `load`.
- beta: Has the same meaning as in [TRL](https://huggingface.co/docs/trl/main/en/dpo_trainer#trl.DPOConfig). It controls the degree of deviation from the reference model. A higher beta value indicates less deviation from the reference model. For the IPO loss function (`loss_type="ipo"`), beta is the regularization parameter as mentioned in the [paper](https://huggingface.co/papers/2310.12036). Default is 0.1.
- rpo_alpha: A parameter from the [RPO paper](https://huggingface.co/papers/2404.19733) used to control the weight of the NLL term (i.e., SFT loss) in the loss function. The total loss is calculated as `loss = dpo_loss + rpo_alpha * nll_loss`. Default is 1.
- reference_free: Whether to ignore the provided reference model and implicitly use a reference model that assigns equal probability to all responses. Default is `False`.
- label_smoothing: Default is 0.
- f_divergence_type: Default is `reverse_kl`. See the [TRL documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer) for possible values.
- loss_type: Default is `'sigmoid'`. See the [TRL documentation](https://huggingface.co/docs/trl/main/en/dpo_trainer) for possible values.


### Training Parameters

Megatron training parameters inherit from Megatron parameters and basic parameters. For information on basic parameters, see [here](./Command-line-parameters.md#base-arguments). Additionally, the following parameters are included:

- add_version: Adds a directory `<version>-<timestamp>` to `save` to prevent overwriting weights, default is True.
- 🔥packing: Whether to use sequence packing, defaults to False.
- 🔥packing: Whether to use sequence packing, defaults to False. Currently supports `megatron pt/sft`.
- 🔥packing_cache: Specifies the directory for packing cache. The default value is `None`, which means the cache will be stored in the path defined by the environment variable `$MODELSCOPE_CACHE`. When using the packing feature across multiple nodes, ensure that all nodes share the same packing cache directory. You can achieve this by setting the `MODELSCOPE_CACHE` environment variable or by adding the `--packing_cache <shared_path>` argument in the command line.
- 🔥streaming: Stream reading and processing of the dataset, default is False. It is typically set to True when handling large datasets. For more information on streaming parameters, refer to the command-line parameters documentation.
- lazy_tokenize: Default is False. If this parameter is set to False, all dataset samples are tokenized before training (this avoids errors during training); if set to True, tokenization occurs during training (this saves memory).
- max_epochs: Forces the training to exit after reaching `max_epochs`, and performs validation and saving of the model weights. This parameter is especially useful when using a streaming dataset. Default is None.
- Note: If you use a non-streaming dataset, this parameter will automatically calculate train_iters for you, so there is no need to pass `train_iters` manually.


### RLHF Parameters

In addition to inheriting the training parameters, the following parameters are also supported:

- rlhf_type: Default is 'dpo'. Currently, only 'dpo' is available.
- loss_scale: Overrides the `loss_scale` in [basic parameters](https://idealab.alibaba-inc.com/command_line_arguments.md). Default is 'last_round'.
- calculate_per_token_loss: Overrides the Megatron parameter. Default is False.
File renamed without changes.
33 changes: 33 additions & 0 deletions examples/train/megatron/rlhf/dpo.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
# 4 * 60GiB
PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
NPROC_PER_NODE=4 \
CUDA_VISIBLE_DEVICES=0,1,2,3 \
megatron rlhf \
--rlhf_type dpo \
--load Qwen3-8B-Base-mcore \
--dataset 'hjh0119/shareAI-Llama3-DPO-zh-en-emoji#20000' \
--tensor_model_parallel_size 4 \
--micro_batch_size 8 \
--global_batch_size 16 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--max_epochs 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_iters 50 \
--min_lr 1e-6 \
--save megatron_output/Qwen3-8B-Base \
--eval_interval 200 \
--save_interval 200 \
--max_length 8192 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--attention_backend flash \
--beta 0.1 \
--rpo_alpha 1 \
--loss_type sigmoid
36 changes: 36 additions & 0 deletions examples/train/megatron/rlhf/moe.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
# 8 * 64GiB
NPROC_PER_NODE=8 \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
megatron rlhf \
--rlhf_type dpo \
--load Qwen1.5-MoE-A2.7B-mcore \
--dataset 'hjh0119/shareAI-Llama3-DPO-zh-en-emoji#20000' \
--tensor_model_parallel_size 2 \
--expert_model_parallel_size 4 \
--moe_grouped_gemm true \
--moe_shared_expert_overlap true \
--moe_aux_loss_coeff 0.01 \
--micro_batch_size 4 \
--global_batch_size 16 \
--recompute_granularity full \
--recompute_method uniform \
--recompute_num_layers 1 \
--max_epochs 1 \
--finetune true \
--cross_entropy_loss_fusion true \
--lr 1e-5 \
--lr_warmup_iters 100 \
--min_lr 1e-6 \
--save megatron_output/Qwen1.5-MoE-A2.7B \
--eval_interval 200 \
--save_interval 200 \
--max_length 8192 \
--num_workers 8 \
--dataset_num_proc 8 \
--no_save_optim true \
--no_save_rng true \
--sequence_parallel true \
--attention_backend flash \
--beta 0.1 \
--rpo_alpha 1 \
--loss_type sigmoid
1 change: 1 addition & 0 deletions swift/cli/_megatron/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
ROUTE_MAPPING: Dict[str, str] = {
'pt': 'swift.cli._megatron.pt',
'sft': 'swift.cli._megatron.sft',
'rlhf': 'swift.cli._megatron.rlhf',
}


Expand Down
5 changes: 5 additions & 0 deletions swift/cli/_megatron/rlhf.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from swift.megatron import megatron_rlhf_main

if __name__ == '__main__':
megatron_rlhf_main()
8 changes: 4 additions & 4 deletions swift/megatron/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@
from swift.utils.import_utils import _LazyModule

if TYPE_CHECKING:
from .train import megatron_sft_main, megatron_pt_main
from .train import megatron_sft_main, megatron_pt_main, megatron_rlhf_main
from .utils import convert_hf2mcore, convert_mcore2hf
from .argument import MegatronTrainArguments
from .argument import MegatronTrainArguments, MegatronRLHFArguments
from .model import MegatronModelType, MegatronModelMeta, get_megatron_model_meta, register_megatron_model
else:
_import_structure = {
'train': ['megatron_sft_main', 'megatron_pt_main'],
'train': ['megatron_sft_main', 'megatron_pt_main', 'megatron_rlhf_main'],
'utils': ['convert_hf2mcore', 'convert_mcore2hf'],
'argument': ['MegatronTrainArguments'],
'argument': ['MegatronTrainArguments', 'MegatronRLHFArguments'],
'model': ['MegatronModelType', 'MegatronModelMeta', 'get_megatron_model_meta', 'register_megatron_model']
}

Expand Down
1 change: 1 addition & 0 deletions swift/megatron/argument/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from .megatron_args import MegatronArguments
from .rlhf_args import MegatronRLHFArguments
from .train_args import MegatronTrainArguments
18 changes: 15 additions & 3 deletions swift/megatron/argument/megatron_args.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,19 @@


@dataclass
class ExtraMegatronArguments:
class RLHFMegatronArgumentsMixin:
ref_load: Optional[str] = None

beta: float = 0.1
rpo_alpha: float = 1.
reference_free: bool = False
label_smoothing: float = 0.
f_divergence_type: str = 'reverse_kl'
loss_type: str = 'sigmoid'


@dataclass
class ExtraMegatronArguments(RLHFMegatronArgumentsMixin):
padded_vocab_size: Optional[int] = None
rope_scaling: Optional[Union[dict, str]] = None
torch_dtype: Optional[torch.dtype] = None
Expand Down Expand Up @@ -150,7 +162,7 @@ class MegatronArguments(ExtraMegatronArguments):

# logging
log_params_norm: bool = False
log_throughput: bool = True
log_throughput: bool = False
tensorboard_log_interval: int = 1
tensorboard_queue_size: int = 50
log_timers_to_tensorboard: bool = True
Expand All @@ -163,7 +175,7 @@ class MegatronArguments(ExtraMegatronArguments):
wandb_save_dir: Optional[str] = None

# evaluate
eval_iters: int = 100
eval_iters: int = -1
eval_interval: Optional[int] = None

# other
Expand Down
13 changes: 13 additions & 0 deletions swift/megatron/argument/rlhf_args.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) Alibaba, Inc. and its affiliates.
from dataclasses import dataclass
from typing import Literal

from .train_args import MegatronTrainArguments


@dataclass
class MegatronRLHFArguments(MegatronTrainArguments):
rlhf_type: Literal['dpo'] = 'dpo'
loss_scale: str = 'last_round'

calculate_per_token_loss: bool = False
Loading
Loading