-
Notifications
You must be signed in to change notification settings - Fork 168
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
8 changed files
with
305 additions
and
86 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# 增量预训练教程 | ||
|
||
# 增量预训练简介 | ||
增量预训练旨在提升模型在特定领域或任务的能力。 | ||
|
||
|
||
# 预训练流程 | ||
- Step1 处理数据 | ||
- Step2 配置config(全量、Lora、Qlora) | ||
- Step3 启动训练(单卡、多卡、是否使用deepspeed) | ||
- Step4 模型合成 | ||
- Step5 模型测试 | ||
- Step6 模型上传 | ||
|
||
# EmoLLM增量预训练教程 | ||
基于微调中的数据集[datasets](../../datasets)修改而来 | ||
|
||
- Step1 修改`ft2pt.py`中的文件路径 | ||
这里以[output2.json](../../datasets/processed/output2.json)为例,运行脚本生成[pt.json](../../datasets/pt/pt.json) | ||
|
||
- Step2 [config](./internlm2_chat_1_8b_qlora_e3_pt.py) | ||
注意:本config采用了**变长注意力 (Variable Length Attention)** | ||
需要安装flash_attn | ||
`MAX_JOBS=4 pip install flash-attn --no-build-isolation` | ||
|
||
|
||
- Step3 训练: | ||
``` | ||
# On a single GPU | ||
xtuner train internlm2_chat_1_8b_qlora_e3_pt.py --deepspeed deepspeed_zero2 | ||
# On multiple GPUs | ||
(DIST) NPROC_PER_NODE=${GPU_NUM} xtuner train internlm2_chat_1_8b_qlora_e3_pt.py --deepspeed deepspeed_zero2 | ||
(SLURM) srun ${SRUN_ARGS} xtuner train internlm2_chat_1_8b_qlora_e3_pt.py --launcher slurm --deepspeed deepspeed_zero2 | ||
``` | ||
|
||
- 其余流程请参考[微调教程](../../xtuner_config/README.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# 将微调的数据格式转为预训练的格式 | ||
import json | ||
|
||
|
||
def convert(data_path:str, target_path:str): | ||
# 假设原始JSON数据存储在名为'data.json'的文件中 | ||
filename = data_path | ||
|
||
# 读取文件内容 | ||
with open(filename, 'rt', encoding='utf-8') as file: | ||
original_json = file.read() | ||
|
||
# 将原始JSON字符串解析为Python对象 | ||
data = json.loads(original_json) | ||
|
||
# 遍历每个对话 | ||
converted_data = [] | ||
|
||
# 遍历原始数据中的每个对话对象 | ||
for conversation_group in data: | ||
# 遍历每个对话 | ||
for dialog in conversation_group["conversation"]: | ||
# 创建一个新的对话对象,用于存储转换后的对话 | ||
new_conversation_group = { | ||
"conversation": [] | ||
} | ||
# 创建一个新的对话,其中输出被替换为"xxx" | ||
new_dialog = { | ||
"input": '', | ||
"output": f'问题:{dialog["input"]}\n答案:{dialog["output"]}', | ||
} | ||
# 将新的对话添加到新对话对象的列表中 | ||
new_conversation_group["conversation"].append(new_dialog) | ||
|
||
# 将新对话对象添加到转换后的数据列表中 | ||
converted_data.append(new_conversation_group) | ||
|
||
|
||
# 将更新后的数据转换回JSON字符串,并格式化输出 | ||
updated_json = json.dumps(converted_data, indent=4, ensure_ascii=False) | ||
|
||
|
||
# 将更新后的JSON数据写入到新的文件中 | ||
with open(f'{target_path}', 'wt', encoding='utf-8') as file: | ||
file.write(updated_json) | ||
|
||
if __name__ == '__main__': | ||
convert(data_path='./output2.json', target_path='pt.json') |
Oops, something went wrong.