Skip to content

submit deepspeed task卡住 #5775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Silver-Glacier opened this issue May 22, 2025 · 2 comments
Open

submit deepspeed task卡住 #5775

Silver-Glacier opened this issue May 22, 2025 · 2 comments

Comments

@Silver-Glacier
Copy link

大佬好,我使用AnsibleFATE_2.1.1_LLM_2.1.0_release_offline部署,有两台GPU机器,使用flow test测试可以通过,但执行我自己的训练代码时会遇到任务卡住不动,日志里没有报错,且使用nvidia-smi未看到任何存在的进程,从输出日志中看到执行 【start submit deepspeed task deepspeed_202505221013311247640_nn_0_0_guest_9999】后,便没有日志更新了。完整日志如下

@Silver-Glacier
Copy link
Author

12
[INFO][2025-05-22 10:13:36,571][1021][_profile.profile_ends][line:290]:
14
+----------+------------------------------------------+
16
+----------+------------------------------------------+
17
| total | n=0, sum=0.0000, mean=0.0000, max=0.0000 |
18
+----------+------------------------------------------+
19
20
Federation:
21
+--------+------------------------------------------+
22
| get | |
23
+--------+------------------------------------------+
24
| remote | |
25
+--------+------------------------------------------+
26
| total | n=0, sum=0.0000, mean=0.0000, max=0.0000 |
27
+--------+------------------------------------------+
28
29
[INFO][2025-05-22 10:13:36,634][960][_wraps.run_component][line:168]: finish task, return code 0
30
[INFO][2025-05-22 10:13:36,635][960][_wraps._push_data][line:216]: save data
31
[INFO][2025-05-22 10:13:36,635][960][_wraps._push_data][line:226]: save data tracking to experiment, ad
32
[INFO][2025-05-22 10:13:36,682][960][_wraps.log_response][line:331]: {'code': 0, 'message': 'success'}
33
[INFO][2025-05-22 10:13:36,683][960][_wraps._push_metric][line:309]: output metric: uri='http://xxx.xxx.xxx.xxx:9380/v2/worker/metric/save/202505221013311247640_reader_0_0_guest_9999' metadata=Metadata(metadata={}, name=None, namespace=None, model_overview={}, data_overview=None, source=ArtifactSource(task_id='202505221013311247640_reader_0', party_task_id='202505221013311247640_reader_0_0_guest_9999', task_name='reader_0', component='reader', output_artifact_key='metric', output_index=None), model_key=None, type_name=None, index=None) type_name='json_metric' consumed=False
35
[INFO][2025-05-22 10:13:36,784][960][_wraps.log_response][line:331]: {'code': 0, 'message': 'success'}
36
[INFO][2025-05-22 10:13:38,857][1244][_base._cleanup2][line:77]: start clean task, config: {'computing': {'type': 'eggroll', 'metadata': {'computing_id': '202505221013311247640_reader_0_0_guest_9999', 'host': 'xxx.xxx.xxx.xxx', 'port': 4670, 'config_options': None, 'config_properties_file': None, 'options': {}}}, 'federation': {'type': 'rollsite', 'metadata': {'federation_id': '202505221013311247640_reader_0_0', 'parties': {'local': {'role': 'guest', 'partyid': '9999'}, 'parties': [{'role': 'guest', 'partyid': '9999'}, {'role': 'host', 'partyid': '10000'}]}, 'rollsite_config': {'host': 'xxx.xxx.xxx.xxx', 'port': 9370}}}}
37
[INFO][2025-05-22 10:13:39,667][1304][_wraps.preprocess][line:113]: start generating input artifacts
38
[INFO][2025-05-22 10:13:39,667][1304][_wraps.preprocess][line:114]: data={'train_data': RuntimeTaskOutputChannelSpec(producer_task='reader_0', output_artifact_key='output_data', output_artifact_type_alias=None, parties=[PartySpec(role='guest', party_id=['9999'])])} model=None
39
[INFO][2025-05-22 10:13:39,832][1244][_base._cleanup2][line:85]: clean success
40
[INFO][2025-05-22 10:13:43,653][1304][_wraps._intput_data_artifacts][line:453]: get key[train_data] channel[producer_task='reader_0' output_artifact_key='output_data' output_artifact_type_alias=None parties=[PartySpec(role='guest', party_id=['9999'])]]
41
[INFO][2025-05-22 10:13:43,653][1304][_wraps._intput_data_artifacts][line:479]: query data: [{'job_id': '202505221013311247640', 'role': 'guest', 'party_id': '9999', 'task_name': 'reader_0', 'output_key': 'output_data'}]
42
[INFO][2025-05-22 10:13:43,692][1304][_wraps._intput_data_artifacts][line:491]: intput data artifacts are ready
43
[INFO][2025-05-22 10:13:43,692][1304][_wraps.preprocess][line:116]: input artifacts are ready
44
[INFO][2025-05-22 10:13:43,692][1304][_wraps.preprocess][line:118]: PYTHON PATH: /data/projects/fate/fate_flow/python:/data/projects/fate/fate/python:/data/projects/fate/fate_flow/python:/data/projects/fate/eggroll/python
45
[INFO][2025-05-22 10:13:43,692][1304][_wraps.preprocess][line:121]: start generating output artifacts
50
[INFO][2025-05-22 10:13:45,401][1304][_eggroll_deepspeed.start_submit][line:116]: command_arguments: ['component', 'execute', '--env-name', 'FATE_TASK_CONFIG', '--execution-final-meta-path', 'EGGROLL_DEEPSPEED_RESULT_DIR/task_result.yaml']
51
[INFO][2025-05-22 10:13:45,401][1304][_eggroll_deepspeed.start_submit][line:117]: environment_variables: {'FATE_TASK_CONFIG': '{"job_id": "202505221013311247640", "task_id": "202505221013311247640_nn_0", "party_task_id": "202505221013311247640_nn_0_0_guest_9999", "task_name": "nn_0", "component": "homo_nn", "role": "guest", "party_id": "9999", "stage": "train", "parameters": {"runner_class": "Seq2SeqRunner", "runner_conf": {"algo": "fedavg", "data_collator_conf": {"item_name": "get_seq2seq_data_collator", "kwargs": {"tokenizer_name_or_path": "/app/chatglm3-6b/", "trust_remote_code": true}, "module_name": "fate_llm.data.data_collator.cust_data_collator", "source": null}, "dataset_conf": {"item_name": "PromptDataset", "kwargs": {"tokenizer_name_or_path": "/app/chatglm3-6b/", "trust_remote_code": true}, "module_name": "fate_llm.dataset.prompt_dataset", "source": null}, "fed_args_conf": {"aggregate_freq": 1, "aggregate_strategy": "epoch", "aggregator": "secure_aggregate"}, "model_conf": {"item_name": "ChatGLM", "kwargs": {"peft_config": {"alpha_pattern": {}, "auto_mapping": null, "base_model_name_or_path": null, "bias": "none", "fan_in_fan_out": false, "inference_mode": false, "init_lora_weights": true, "layers_pattern": null, "layers_to_transform": null, "loftq_config": {}, "lora_alpha": 32, "lora_dropout": 0.1, "megatron_config": null, "megatron_core": "megatron.core", "modules_to_save": null, "peft_type": "LORA", "r": 8, "rank_pattern": {}, "revision": null, "target_modules": ["query_key_value"], "task_type": "CAUSAL_LM", "use_rslora": false}, "peft_type": "LoraConfig", "pretrained_path": "/app/chatglm3-6b/", "trust_remote_code": true}, "module_name": "fate_llm.model_zoo.pellm.chatglm", "source": null}, "optimizer_conf": null, "save_trainable_weights_only": true, "task_type": "causal_lm", "tokenizer_conf": null, "training_args_conf": {"dataloader_pin_memory": true, "deepspeed": {"fp16": {"enabled": true}, "gradient_accumulation_steps": 1, "optimizer": {"params": {"adam_w_mode": false, "lr": 0.0005, "torch_adam": true}, "type": "Adam"}, "train_micro_batch_size_per_gpu": 1, "zero_optimization": {"allgather_bucket_size": 1xxx.xxx.xxx.xxx0.0, "allgather_partitions": true, "contiguous_gradients": true, "offload_optimizer": {"device": "cpu"}, "offload_param": {"device": "cpu"}, "overlap_comm": true, "reduce_bucket_size": 1xxx.xxx.xxx.xxx0.0, "reduce_scatter": true, "stage": 2}}, "fp16": true, "learning_rate": 0.0005, "num_train_epochs": 1, "per_device_train_batch_size": 1, "remove_unused_columns": false, "use_cpu": false}}, "runner_module": "homo_seq2seq_runner"}, "input_artifacts": {"train_data": {"uri": "file:///app/train.json", "metadata": {"metadata": {"options": {"partitions": 8}, "schema": {}}, "name": null, "namespace": null, "model_overview": {}, "data_overview": null, "source": null, "model_key": null, "type_name": null, "index": null}, "type_name": "data_directory"}}, "output_artifacts": {"train_output_data": {"uri": "eggroll:///202505221013311247640_nn_0/62ba3bc236b211f0b8aafa163e402514", "type_name": "dataframe"}, "output_model": {"uri": "file://EGGROLL_DEEPSPEED_MODEL_DIR/202505221013311247640/guest/9999/nn_0/0/output/output_model/model_directory", "type_name": "model_directory"}, "metric": {"uri": "http://xxx.xxx.xxx.xxx:9380/v2/worker/metric/save/202505221013311247640_nn_0_0_guest_9999", "type_name": "json_metric"}}, "conf": {"device": {"type": "CPU", "metadata": {}}, "computing": {"type": "eggroll", "metadata": {"computing_id": "202505221013311247640_nn_0_0_guest_9999", "host": "xxx.xxx.xxx.xxx", "port": 4670, "config_options": null, "config_properties_file": null, "options": {}}}, "storage": "eggroll", "federation": {"type": "rollsite", "metadata": {"federation_id": "202505221013311247640_nn_0_0", "parties": {"local": {"role": "guest", "partyid": "9999"}, "parties": [{"role": "guest", "partyid": "9999"}, {"role": "host", "partyid": "10000"}, {"role": "arbiter", "partyid": "10000"}]}, "rollsite_config": {"host": "xxx.xxx.xxx.xxx", "port": 9370}}}, "logger": {"config": {"disable_existing_loggers": false, "filters": {"component_profile_filter": {"()": "logging.Filter", "name": "fate.arch.computing._profile"}}, "formatters": {"component": {"format": "[%(levelname)s][%(asctime)-8s][%(process)s][%(module)s.%(funcName)s][line:%(lineno)d]: %(message)s"}, "root": {"format": "[%(levelname)s][%(asctime)-8s][%(process)s][%(module)s.%(funcName)s][line:%(lineno)d]: %(message)s"}}, "handlers": {"component_error": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/nn_0/component/ERROR", "filters": [], "formatter": "component", "level": "ERROR"}, "component_info": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/nn_0/component/INFO", "filters": [], "formatter": "component", "level": "INFO"}, "component_profile": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/nn_0/component/PROFILE", "filters": ["component_profile_filter"], "formatter": "component", "level": "DEBUG"}, "component_warning": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/nn_0/component/WARNING", "filters": [], "formatter": "component", "level": "WARNING"}, "global_error": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/ERROR", "filters": [], "formatter": "root", "level": "ERROR"}, "root_error": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/nn_0/root/ERROR", "filters": [], "formatter": "root", "level": "ERROR"}, "root_info": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/nn_0/root/INFO", "filters": [], "formatter": "root", "level": "INFO"}, "root_party_error": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/ERROR", "filters": [], "formatter": "root", "level": "ERROR"}, "root_party_info": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/INFO", "filters": [], "formatter": "root", "level": "INFO"}, "root_party_warning": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/WARNING", "filters": [], "formatter": "root", "level": "WARNING"}, "root_warning": {"class": "logging.FileHandler", "delay": true, "filename": "EGGROLL_DEEPSPEED_LOGS_DIR/202505221013311247640/guest/9999/nn_0/root/WARNING", "filters": [], "formatter": "root", "level": "WARNING"}}, "loggers": {"fate": {"handlers": ["component_info", "component_warning", "component_error", "component_profile"], "level": "INFO"}}, "root": {"handlers": ["root_info", "root_warning", "root_error", "root_party_info", "root_party_warning", "root_party_error", "global_error"], "level": "INFO"}, "version": 1}}}}', 'DEEPSPEED_LOGS_DIR_PLACEHOLDER': 'EGGROLL_DEEPSPEED_LOGS_DIR', 'DEEPSPEED_MODEL_DIR_PLACEHOLDER': 'EGGROLL_DEEPSPEED_MODEL_DIR', 'DEEPSPEED_RESULT_PLACEHOLDER': 'EGGROLL_DEEPSPEED_RESULT_DIR'}
52
[INFO][2025-05-22 10:13:45,401][1304][_eggroll_deepspeed.start_submit][line:118]: resource_options: {'timeout_seconds': 21600, 'resource_exhausted_strategy': 'waiting', 'cores': 1, 'nodes': 1, 'task_cores_per_node': 1}
53
[INFO][2025-05-22 10:13:45,401][1304][_eggroll_deepspeed.start_submit][line:119]: options: {'eggroll.container.deepspeed.script.path': '/data/projects/fate/fate_flow/python/fate_flow/manager/worker/fate_ds_executor.py'}
54
[INFO][2025-05-22 10:13:45,401][1304][_eggroll_deepspeed.start_submit][line:120]: start submit deepspeed task deepspeed_202505221013311247640_nn_0_0_guest_9999

@Silver-Glacier
Copy link
Author

运行6小时后报错:debug_error_string = "UNKNOWN:Error received from peer {grpc_message:"dispatch resource timeout",grpc_status:13,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant