You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+33-2Lines changed: 33 additions & 2 deletions
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,12 @@ FlagScale provides developers with the actual configurations, optimization schem
13
13
14
14
## News and Updates
15
15
16
-
* 2023.10.11 We release the initial version by supporting the Aquila models, and also provide our actually used training schemes for [Aquila2-7B](./examples/aquila/7B/pretrain_aquila_7b_distributed_A800_12n_80g.sh) and [Aquila2-34B](./examples/aquila/34B/pretrain_aquila_34b_distributed_A100_64n_40g.sh), including the parallel strategies, optimizations and hyper-parameter settings.
16
+
* 2023.11.30 We release the new version (v0.2):
17
+
* Provide the actually used training scheme for [Aquila2-70B-Expr](./examples/aquila/70B), including the parallel strategies, optimizations and hyper-parameter settings.
18
+
* Support heterogeneous training on chips of different generations with the same architecture or compatible architectures, including NVIDIA GPUs and Iluvatar CoreX chips.
19
+
* Support training on chinese domestic hardwares, including Iluvatar CoreX and Baidu KUNLUN chips.
20
+
21
+
* 2023.10.11 We release the initial version (v0.1) by supporting the Aquila models, and also provide our actually used training schemes for [Aquila2-7B](./examples/aquila/7B/pretrain_aquila_7b_distributed_A800_12n_80g.sh) and [Aquila2-34B](./examples/aquila/34B/pretrain_aquila_34b_distributed_A100_64n_40g.sh), including the parallel strategies, optimizations and hyper-parameter settings.
17
22
18
23
## Quick Start
19
24
@@ -30,7 +35,7 @@ cd FlagScale
30
35
pip install -r requirements.txt
31
36
```
32
37
33
-
### Pretrain the aquila model
38
+
### Pretrain the Aquila model
34
39
35
40
1. Change to the aquila directory
36
41
@@ -63,6 +68,32 @@ bash dist_stop.sh
63
68
Before running `dist_stop.sh`, you should provide the required information:
64
69
* `HOSTFILE`: the hostfile of the nodes for the current training.
65
70
71
+
### Do the heterogenous training
72
+
73
+
It is very simple to do the heterogeneous training on chips of different generations with the same architecture or compatible architectures. You only need to follow the steps below and everything else just remains the same as the above homogeneous training. In addition, you can also refer to the examples [1](./examples/aquila/34B/pretrain_aquila_34b_distributed_A800_16n_80g_A100_48n_40g_hetero_pp.sh), [2](./examples/aquila/34B/pretrain_aquila_34b_distributed_A800_16n_80g_A100_48n_40g_hetero_dp.sh), [3](./examples/aquila/70B/pretrain_aquila_70b_distributed_A800_16n_80g_A100_48n_40g_hetero_pp.sh) for better understanding.
74
+
75
+
1. Extend the hostfile
76
+
77
+
Before doing the heterogenous training, you should extend the hostfile by adding the device types. You are free to choose the identifier strings for these device types, but please ensure they are not duplicated.
78
+
79
+
```
80
+
hostnames-1/IP-1 slots=8 typeA
81
+
hostnames-2/IP-2 slots=8 typeB
82
+
```
83
+
84
+
2. Add the heterogeneous configuration
85
+
* If you choose the heterogenous pipeline parallelism mode, please set the following configurations:
86
+
* `hetero-mode`: specify the heterogenous training mode `pp`.
87
+
* `hetero-current-device-type`: specify the device type of the current node.
88
+
* `hetero-device-types`: specify all the device types used in this training.
89
+
* `hetero-pipeline-stages`: specify the stage splitting configuration. For example, given `2 4 4 3 5 5 5`, the total pipeline parallel size is `2 + 3 = 5`, the total number of the model layers is `4 + 4 + 5 + 5 + 5 = 23`, the pipeline parallel size for the first device type in the `hetero-device-types` list is `2` and the pipeline parallel size for the second device type in the `hetero-device-types` is list `3`.
90
+
91
+
* If you choose the heterogenous data parallelism mode, please set the following configurations:
92
+
* `hetero-mode`: specify the heterogenous training mode `dp`.
93
+
* `hetero-current-device-type`: specify the device type of the current node.
94
+
* `hetero-device-types`: specify all the device types used in this training.
95
+
* `hetero-micro-batch-sizes`: specify the micro batch size splitting configuration. For example, given `2 1 3 2`, the total data parallel size is `2 + 3 = 5` and the micro batch size for each training iteration is `2 * 1 + 3 * 2 = 8`, the data parallel size for the first device type in the `hetero-device-types` list is `2` and the data parallel size for the second device type in the `hetero-device-types` is `3` list.
96
+
* **Remove** the `micro-batch-size` configuration because `hetero-micro-batch-sizes` works as the same purpose.
0 commit comments