modelscope
diff --git a/‎README.md
Lines changed: 1 addition & 0 deletions b/‎README.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎README_CN.md
Lines changed: 1 addition & 0 deletions b/‎README_CN.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/Instruction/GRPO.md
Lines changed: 6 additions & 3 deletions b/‎docs/source/Instruction/GRPO.md
Lines changed: 6 additions & 3 deletions
diff --git a/‎docs/source/Instruction/Megatron-SWIFT训练.md
Lines changed: 1 addition & 0 deletions b/‎docs/source/Instruction/Megatron-SWIFT训练.md
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/source/Instruction/命令行参数.md
Lines changed: 3 additions & 1 deletion b/‎docs/source/Instruction/命令行参数.md
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/source/Instruction/支持的模型和数据集.md
Lines changed: 3 additions & 1 deletion b/‎docs/source/Instruction/支持的模型和数据集.md
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/source_en/Instruction/Command-line-parameters.md
Lines changed: 3 additions & 1 deletion b/‎docs/source_en/Instruction/Command-line-parameters.md
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/source_en/Instruction/GRPO.md
Lines changed: 3 additions & 0 deletions b/‎docs/source_en/Instruction/GRPO.md
Lines changed: 3 additions & 0 deletions
diff --git a/‎docs/source_en/Instruction/Megatron-SWIFT-Training.md
Lines changed: 1 addition & 0 deletions b/‎docs/source_en/Instruction/Megatron-SWIFT-Training.md
Lines changed: 1 addition & 0 deletions
@@ -74,6 +74,7 @@ You can contact us and communicate with us by adding our group:
 
 
 ## 🎉 News
+- 🎁 2025.05.29: Support sequence parallel in pt, sft, dpo and grpo, check script [here](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text).
 - 🎁 2025.05.11: GRPO now supports custom processing logic for reward models. See the GenRM example [here](./docs/source_en/Instruction/GRPO.md#customized-reward-models).
 - 🎁 2025.04.15: The ms-swift paper has been accepted by AAAI 2025. You can find the paper at [this link](https://ojs.aaai.org/index.php/AAAI/article/view/35383).
 - 🎁 2025.03.23: Multi-round GRPO is now supported for training multi-turn dialogue scenarios (e.g., agent tool calling). Please refer to the [training script](examples/train/grpo/internal/vllm_multi_round.sh).
 
@@ -70,6 +70,7 @@
 - **模型量化**：支持AWQ、GPTQ和BNB的量化导出，导出的模型支持使用vLLM/LmDeploy推理加速，并支持继续训练。
 
 ## 🎉 新闻
+- 🎁 2025.05.29: 支持pt、sft、dpo、grpo的序列并行，具体请查看[脚本](https://github.com/modelscope/ms-swift/tree/main/examples/train/long_text).
 - 🎁 2025.05.11: GRPO中的奖励模型支持自定义处理逻辑，GenRM的例子参考[这里](./docs/source/Instruction/GRPO.md#自定义奖励模型)。
 - 🎁 2025.04.15: ms-swift论文已经被AAAI 2025接收，论文地址在[这里](https://ojs.aaai.org/index.php/AAAI/article/view/35383)。
 - 🎁 2025.03.23: 支持了多轮GRPO，用于构建多轮对话场景的训练(例如agent tool calling)，请查看[训练脚本](examples/train/grpo/internal/vllm_multi_round.sh)。
 
@@ -13,9 +13,10 @@ pip install -U trl
 GRPOTrainer在swift3.5.dev进行了代码重构，如果你使用的swift版本<3.5, 请参考[stable文档](https://github.com/modelscope/ms-swift/blob/v3.4.1/docs/source/Instruction/GRPO.md)
 
 **更新日志**
-- **2025-05-23** — 支持自定义采样批量大小，参考 generation_batch_size / steps_per_generation 参数
-- **2025-05-22** — swift rollout 支持 data_parallel_size 参数
-- **2025-05-16** - 增加 ref_model 同步逻辑，参考参数 sync_ref_model
+- **2025-05-29** — 支持了padding_free(--padding_free true)和序列并行(--sequence_parallel_size N)。
+- **2025-05-23** — 支持自定义采样批量大小，参考 generation_batch_size / steps_per_generation 参数。
+- **2025-05-22** — swift rollout 支持 data_parallel_size 参数。
+- **2025-05-16** - 增加 ref_model 同步逻辑，参考参数 sync_ref_model。
 - **2025-05-13** — 为了代码的可读性和维护性， GRPOTrainer代码重构，Internal mode 支持vLLM>=0.8。
 - **2025-05-11** — 支持生成式奖励模型，通过 reward_model_plugin 自定义奖励模型逻辑。有关更多详细信息，请参阅[自定义奖励模型](#自定义奖励模型)部分。
 - **2025-04-30** — external vllm server 的启动命令改为 `swift rollout`。
@@ -236,6 +237,8 @@ A conversation between User and Assistant. The user asks a question, and the Ass
 - dynamic_sample：筛除group内奖励标准差为0的数据，额外采样新数据，默认为False。
 - max_resample_times：dynamic_sample设置下限制重采样次数，默认3次。
 - overlong_filter：跳过超长截断的样本，不参与loss计算，默认为False。
+- padding_free: 去掉所有padding token，并将有效token拼接到一个batch中，仅支持flash_attn.
+- sequence_parallel_size: 序列并行段数
 
 奖励函数参数，见[内置奖励函数](#内置奖励函数)
 
 
@@ -296,6 +296,7 @@ Megatron训练参数继承自Megatron参数和基本参数。基本参数的内
 
 - add_version: 在`save`上额外增加目录`'<版本号>-<时间戳>'`防止权重覆盖，默认为True。
 - 🔥packing: 是否使用序列packing，默认为False。
+- 🔥packing_cache: 指定 packing 缓存目录。默认值为`None`，表示缓存将存储在环境变量 `$MODELSCOPE_CACHE`所指定的路径下。在跨节点使用 packing 功能时，需确保所有节点的 packing 缓存路径共享且一致。你可以通过设置`MODELSCOPE_CACHE`环境变量，或在命令行中添加 `--packing_cache <shared_path>`参数来实现这一要求。
 - 🔥streaming: 流式读取并处理数据集，默认False。通常在处理大型数据集时，设置为True。更多流式的参数查看命令行参数文档。
 - lazy_tokenize: 默认为False。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（这可以避免在训练中出现报错）；设置为True，则在训练中对数据集进行tokenize（这可以节约内存）。
 - max_epochs: 训练到`max_epochs`时强制退出训练，并对权重进行验证和保存。该参数在使用流式数据集时很有用。默认为None。
@@ -354,11 +354,13 @@ Vera使用`target_modules`, `target_regex`, `modules_to_save`三个参数.
 - 🔥packing: 是否使用序列packing提升计算效率，默认为False。当前支持`swift pt/sft`。
   - 注意：使用packing请结合`--attn_impl flash_attn`使用且"transformers>=4.44"，具体查看[该PR](https://github.com/huggingface/transformers/pull/31629)。
   - 支持的多模态模型参考：https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh
+- packing_cache: 指定 packing 缓存目录。默认值为`None`，表示缓存将存储在环境变量 `$MODELSCOPE_CACHE`所指定的路径下。在跨节点使用 packing 功能时，需确保所有节点的 packing 缓存路径共享且一致。你可以通过设置`MODELSCOPE_CACHE`环境变量，或在命令行中添加 `--packing_cache <shared_path>`参数来实现这一要求。
 - 🔥lazy_tokenize: 是否使用lazy_tokenize。若该参数设置为False，则在训练之前对所有的数据集样本进行tokenize（多模态模型则包括从磁盘中读取图片）。该参数在LLM训练中默认设置为False，而MLLM训练默认为True，节约内存。
+- use_logits_to_keep: 通过在`forward`中根据labels传入logits_to_keep，减少无效logits的计算与存储，从而减少显存占用并加快训练速度。默认为None，进行自动选择。
 - acc_strategy: 训练和验证时计算acc的策略。可选为`seq`和`token`级别的acc，默认为`token`。
 - max_new_tokens: 覆盖生成参数。predict_with_generate=True时的最大生成token数量，默认64。
 - temperature: 覆盖生成参数。predict_with_generate=True时的temperature，默认0。
-- optimizer: plugin的自定义optimizer名称，默认为None。
+- optimizer: plugin的自定义optimizer名称，默认为None。可选optimizer参考[这里](https://github.com/modelscope/ms-swift/blob/main/swift/plugin/optimizer.py)。
 - metric: plugin的自定义metric名称。默认为None，即在predict_with_generate=False的情况下设置为'acc'，在predict_with_generate=True的情况下设置为'nlg'。
 - eval_use_evalscope: 是否使用evalscope进行训练时评测，需要设置该参数来开启评测，具体使用参考[示例](../Instruction/评测.md#训练中评测)。
 - eval_datasets: 评测数据集，可设置多个数据集，用空格分割。
 
@@ -398,14 +398,16 @@
 |[deepseek-ai/DeepSeek-Prover-V2-671B](https://modelscope.cn/models/deepseek-ai/DeepSeek-Prover-V2-671B)|deepseek_v2_5|deepseek_v2_5|transformers>=4.39.3|&#x2718;|-|[deepseek-ai/DeepSeek-Prover-V2-671B](https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-671B)|
 |[deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1)|deepseek_r1|deepseek_r1|transformers>=4.39.3|&#x2718;|-|[deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1)|
 |[deepseek-ai/DeepSeek-R1-Zero](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Zero)|deepseek_r1|deepseek_r1|transformers>=4.39.3|&#x2718;|-|[deepseek-ai/DeepSeek-R1-Zero](https://huggingface.co/deepseek-ai/DeepSeek-R1-Zero)|
+|[deepseek-ai/DeepSeek-R1-0528](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528)|deepseek_r1|deepseek_r1|transformers>=4.39.3|&#x2718;|-|[deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)|
 |[cognitivecomputations/DeepSeek-R1-awq](https://modelscope.cn/models/cognitivecomputations/DeepSeek-R1-awq)|deepseek_r1|deepseek_r1|transformers>=4.39.3|&#x2718;|-|[cognitivecomputations/DeepSeek-R1-AWQ](https://huggingface.co/cognitivecomputations/DeepSeek-R1-AWQ)|
 |[deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)|deepseek_r1_distill|deepseek_r1|transformers>=4.37|&#x2714;|-|[deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B)|
 |[deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)|deepseek_r1_distill|deepseek_r1|transformers>=4.37|&#x2714;|-|[deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)|
 |[deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)|deepseek_r1_distill|deepseek_r1|transformers>=4.37|&#x2714;|-|[deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B)|
 |[deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)|deepseek_r1_distill|deepseek_r1|transformers>=4.37|&#x2714;|-|[deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B)|
-|[iic/QwenLong-L1-32B](https://modelscope.cn/models/iic/QwenLong-L1-32B)|deepseek_r1_distill|deepseek_r1|transformers>=4.37|&#x2718;|-|[Tongyi-Zhiwen/QwenLong-L1-32B](https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B)|
+|[iic/QwenLong-L1-32B](https://modelscope.cn/models/iic/QwenLong-L1-32B)|deepseek_r1_distill|deepseek_r1|transformers>=4.37|&#x2714;|-|[Tongyi-Zhiwen/QwenLong-L1-32B](https://huggingface.co/Tongyi-Zhiwen/QwenLong-L1-32B)|
 |[deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)|deepseek_r1_distill|deepseek_r1|-|&#x2714;|-|[deepseek-ai/DeepSeek-R1-Distill-Llama-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-8B)|
 |[deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)|deepseek_r1_distill|deepseek_r1|-|&#x2714;|-|[deepseek-ai/DeepSeek-R1-Distill-Llama-70B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Llama-70B)|
+|[deepseek-ai/DeepSeek-R1-0528-Qwen3-8B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B)|deepseek_r1_distill|deepseek_r1|-|&#x2714;|-|[deepseek-ai/DeepSeek-R1-0528-Qwen3-8B](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528-Qwen3-8B)|
 |[OpenBuddy/openbuddy-llama-65b-v8-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama-65b-v8-bf16)|openbuddy_llama|openbuddy|-|&#x2714;|-|[OpenBuddy/openbuddy-llama-65b-v8-bf16](https://huggingface.co/OpenBuddy/openbuddy-llama-65b-v8-bf16)|
 |[OpenBuddy/openbuddy-llama2-13b-v8.1-fp16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-13b-v8.1-fp16)|openbuddy_llama|openbuddy|-|&#x2714;|-|[OpenBuddy/openbuddy-llama2-13b-v8.1-fp16](https://huggingface.co/OpenBuddy/openbuddy-llama2-13b-v8.1-fp16)|
 |[OpenBuddy/openbuddy-llama2-70b-v10.1-bf16](https://modelscope.cn/models/OpenBuddy/openbuddy-llama2-70b-v10.1-bf16)|openbuddy_llama|openbuddy|-|&#x2714;|-|[OpenBuddy/openbuddy-llama2-70b-v10.1-bf16](https://huggingface.co/OpenBuddy/openbuddy-llama2-70b-v10.1-bf16)|
 
@@ -363,11 +363,13 @@ Training arguments include the [base arguments](#base-arguments), [Seq2SeqTraine
 - 🔥packing: Whether to use sequence packing to improve computational efficiency. The default value is False. Currently supports `swift pt/sft`.
   - Note: When using packing, please combine it with `--attn_impl flash_attn` and ensure "transformers>=4.44". For details, see [this PR](https://github.com/huggingface/transformers/pull/31629).
   - Supported multimodal models reference: https://github.com/modelscope/ms-swift/blob/main/examples/train/packing/qwen2_5_vl.sh
+- packing_cache: Specifies the directory for packing cache. The default value is `None`, which means the cache will be stored in the path defined by the environment variable `$MODELSCOPE_CACHE`. When using the packing feature across multiple nodes, ensure that all nodes share the same packing cache directory. You can achieve this by setting the `MODELSCOPE_CACHE` environment variable or by adding the `--packing_cache <shared_path>` argument in the command line.
 - 🔥lazy_tokenize: Whether to use lazy tokenization. If set to False, all dataset samples are tokenized before training (for multimodal models, this includes reading images from disk). This parameter defaults to False for LLM training, and True for MLLM training, to save memory.
+- use_logits_to_keep: Pass `logits_to_keep` in the `forward` method based on labels to reduce the computation and storage of unnecessary logits, thereby reducing memory usage and accelerating training. The default is `None`, which enables automatic selection.
 - acc_strategy: Strategy for calculating accuracy during training and validation. Options are `seq`-level and `token`-level accuracy, with `token` as the default.
 - max_new_tokens: Generation parameter override. The maximum number of tokens to generate when `predict_with_generate=True`, defaulting to 64.
 - temperature: Generation parameter override. The temperature setting when `predict_with_generate=True`, defaulting to 0.
-- optimizer: Custom optimizer name for the plugin, defaults to None.
+- optimizer: Custom optimizer name for the plugin, defaults to None. Optional optimizer reference: [here](https://github.com/modelscope/ms-swift/blob/main/swift/plugin/optimizer.py).
 - metric: Custom metric name for the plugin. Defaults to None, with the default set to 'acc' when `predict_with_generate=False` and 'nlg' when `predict_with_generate=True`.
 - eval_use_evalscope: Whether to use evalscope for evaluation, this parameter needs to be set to enable evaluation, refer to [example](../Instruction/Evaluation.md#evaluation-during-training). Default is False.
 - eval_datasets: Evaluation datasets, multiple datasets can be set, separated by spaces
 
@@ -13,6 +13,7 @@ pip install -U trl
 The GRPOTrainer has been refactored in swift 3.5.dev. If you are using a version of Swift < 3.5 , please refer to the[stable doc](https://github.com/modelscope/ms-swift/blob/v3.4.1/docs/source_en/Instruction/GRPO.md)
 
 **Dev Log**
+- **2025-05-29** — Support padding_free(--padding_free true) and sequence_parallel(--sequence_parallel_size N).
 - **2025-05-23** — Added support for custom sampling batch size (see parameters: generation_batch_size / steps_per_generation).
 - **2025-05-22** — swift rollout now supports the data_parallel_size parameter.
 - **2025-05-16** - Implemented ref_model synchronization logic (see parameter: sync_ref_model).
@@ -247,6 +248,8 @@ Arguments
 - dynamic_sample: Exclude data within the group where the reward standard deviation is 0, and additionally sample new data. Default is False.
 - max_resample_times: Under the dynamic_sample setting, limit the number of resampling attempts to a maximum of 3. Default is 3 times.
 - overlong_filter: Skip overlong truncated samples, which will not be included in loss calculation. Default is False.
+- padding_free: Remove all padding tokens，and concat all valid tokens to one batch，only supports flash_attn.
+- sequence_parallel_size: The segment number of sequence parallels.
 The hyperparameters for the reward function can be found in the [Built-in Reward Functions section](#built-in-reward-functions).
 
 You can use vLLM as sampling backends to accelerate training.
 
@@ -307,6 +307,7 @@ Megatron training parameters inherit from Megatron parameters and basic paramete
 
 - add_version: Adds a directory `<version>-<timestamp>` to `save` to prevent overwriting weights, default is True.
 - 🔥packing: Whether to use sequence packing, defaults to False.
+- 🔥packing_cache: Specifies the directory for packing cache. The default value is `None`, which means the cache will be stored in the path defined by the environment variable `$MODELSCOPE_CACHE`. When using the packing feature across multiple nodes, ensure that all nodes share the same packing cache directory. You can achieve this by setting the `MODELSCOPE_CACHE` environment variable or by adding the `--packing_cache <shared_path>` argument in the command line.
 - 🔥streaming: Stream reading and processing of the dataset, default is False. It is typically set to True when handling large datasets. For more information on streaming parameters, refer to the command-line parameters documentation.
 - lazy_tokenize: Default is False. If this parameter is set to False, all dataset samples are tokenized before training (this avoids errors during training); if set to True, tokenization occurs during training (this saves memory).
 - max_epochs: Forces the training to exit after reaching `max_epochs`, and performs validation and saving of the model weights. This parameter is especially useful when using a streaming dataset. Default is None.