@@ -68,7 +68,7 @@ You always need the root node and you can add 2^n - 1 worker nodes to speed up t
68
68
| ` --model <path> ` | Path to model. | ` dllama_model_meta-llama-3-8b_q40.m ` |
69
69
| ` --tokenizer <path> ` | Tokenizer to model. | ` dllama_tokenizer_llama3.t ` |
70
70
| ` --buffer-float-type <type> ` | Float precision of synchronization. | ` q80 ` |
71
- | ` --workers <workers> ` | Addresses of workers (ip: port ), separated by space. | ` 10.0.0.1:9991 10.0.0.2:9991 ` |
71
+ | ` --workers <workers> ` | Addresses of workers (ip: port ), separated by space. | ` 10.0.0.1:9999 10.0.0.2:9999 ` |
72
72
| ` --max-seq-len <n> ` | The maximum sequence length, it helps to reduce the RAM usage. | ` 4096 ` |
73
73
74
74
Inference, Chat, Worker, API
@@ -145,17 +145,17 @@ Continue to point 3.
145
145
3 . Transfer weights and the tokenizer file to the root computer.
146
146
4 . Run worker nodes on worker computers:
147
147
``` sh
148
- ./dllama worker --port 9998 --nthreads 4
148
+ ./dllama worker --port 9999 --nthreads 4
149
149
```
150
150
5 . Run root node on the root computer:
151
151
``` sh
152
- ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt " Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
152
+ ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt " Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9999
153
153
```
154
154
155
155
To add more worker nodes, just add more addresses to the ` --workers ` argument.
156
156
157
157
```
158
- ./dllama inference ... --workers 192.168.0.1:9998 192.168.0.2:9998 192.168.0.3:9998
158
+ ./dllama inference ... --workers 192.168.0.1:9999 192.168.0.2:9999 192.168.0.3:9999
159
159
```
160
160
161
161
</details >
@@ -192,17 +192,17 @@ sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device
192
192
```
193
193
8 . Run worker nodes on worker devices:
194
194
``` sh
195
- sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
195
+ sudo nice -n -20 ./dllama worker --port 9999 --nthreads 4
196
196
```
197
197
9 . Run root node on the root device:
198
198
``` sh
199
- sudo nice -n -20 ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt " Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998
199
+ sudo nice -n -20 ./dllama inference --model dllama_model_meta-llama-3-8b_q40.m --tokenizer dllama_tokenizer_llama3.t --buffer-float-type q80 --prompt " Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9999
200
200
```
201
201
202
202
To add more worker nodes, just add more addresses to the ` --workers ` argument.
203
203
204
204
```
205
- ./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998
205
+ ./dllama inference ... --workers 10.0.0.2:9999 10.0.0.3:9999 10.0.0.4:9999
206
206
```
207
207
208
208
</details >
0 commit comments