-
Notifications
You must be signed in to change notification settings - Fork 52
在进行Building trainer时,训练会卡住; #46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
这是因为样例测试集的数据量很少,23个batch之后某个数据用光了,某张卡上训练停止了,你需要处理原始的redpajama来满足数据要求 |
你好,我使用的是样例测试集,想跑通README. 但是发现,在训练的时候,会卡住,然后超时;
[batch=23/3200]:
Train time/batch: 22
Train time/sample: 198
Train time/batch_in_epoch: 6
Train time/sample_in_epoch: 54
Train time/token: 811008
Train time/token_in_epoch: 221184
Train metrics/train/cc_weight: 0.6700
Train metrics/train/github_weight: 0.0450
Train metrics/train/book_weight: 0.0450
Train metrics/train/stackexchange_weight: 0.0200
Train metrics/train/wiki_weight: 0.0450
Train metrics/train/arxiv_weight: 0.0250
Train metrics/train/c4-rp_weight: 0.1500
Train memory/current_allocated_mem: 36.8820
Train memory/current_active_mem: 36.8820
Train memory/current_inactive_mem: 0.1744
Train memory/current_reserved_mem: 55.9060
Train memory/peak_allocated_mem: 42.9380
Train memory/peak_active_mem: 42.9380
Train memory/peak_inactive_mem: 7.8742
Train memory/peak_reserved_mem: 55.9060
Train memory/alloc_retries: 0
Train metrics/train/expected_head_sparsity: 0.0039
Train metrics/train/target_head_sparsity: 0.0129
Train metrics/train/expected_intermediate_sparsity: 0.0039
Train metrics/train/target_intermediate_sparsity: 0.0128
Train metrics/train/expected_layer_sparsity: 0.0039
Train metrics/train/target_layer_sparsity: 0.0000
Train metrics/train/expected_hidden_sparsity: 0.0039
Train metrics/train/target_hidden_sparsity: 0.0129
Train metrics/train/expected_sparsity: 0.0117
Train metrics/train/target_sparsity: 0.0209
Train trainer/device_train_microbatch_size: 3
Train loss/train/total: 1.4801
Train loss/train/ce_loss: 1.4716
Train loss/train/lag_loss: 0.0085
Train metrics/train/LanguageCrossEntropy: 1.4716
Train metrics/train/Perplexity: 4.3561
Train metrics/train/cc_LanguageCrossEntropy: 1.1558
Train metrics/train/cc_count: 65
Train metrics/train/github_LanguageCrossEntropy: nan
Train metrics/train/github_count: 7
Train metrics/train/book_LanguageCrossEntropy: nan
Train metrics/train/book_count: 7
Train metrics/train/stackexchange_LanguageCrossEntropy: 2.1491
Train metrics/train/stackexchange_count: 3
Train metrics/train/wiki_LanguageCrossEntropy: 1.5306
Train metrics/train/wiki_count: 8
Train metrics/train/arxiv_LanguageCrossEntropy: nan
Train metrics/train/arxiv_count: 6
Train metrics/train/c4-rp_LanguageCrossEntropy: 1.6471
Train metrics/train/c4-rp_count: 111
Train throughput/batches_per_sec: 0.0914
Train throughput/samples_per_sec: 0.8223
Train throughput/device/batches_per_sec: 0.0305
Train throughput/device/samples_per_sec: 0.2741
Train throughput/tokens_per_sec: 3368.2385
Train throughput/device/tokens_per_sec: 1122.7462
Train throughput/flops_per_sec: 157886485043818.8125
Train throughput/device/flops_per_sec: 52628828347939.6016
Train throughput/device/mfu: 0.1687
Train time/train: 0.0709
Train time/val: 0.0000
Train time/total: 0.0709
Train lr-DecoupledAdamW/group0: 0.0000
Train lr-DecoupledAdamW/group1: 0.0688
Train lr-DecoupledAdamW/group2: -0.0688
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3777, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1802129 milliseconds before timing out.
The text was updated successfully, but these errors were encountered: