Skip to content

Commit 742cb6d

Browse files
authored
[ChatQnA] Switch to vLLM as default llm backend on Xeon (#1403)
Switching from TGI to vLLM as the default LLM serving backend on Xeon for the ChatQnA example to enhance the perf. #1213 Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
1 parent 00e9da9 commit 742cb6d

13 files changed

+259
-254
lines changed

ChatQnA/docker_compose/intel/cpu/xeon/README.md

Lines changed: 25 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Build Mega Service of ChatQnA on Xeon
22

3-
This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`. We will publish the Docker images to Docker Hub soon, it will simplify the deployment process for this service.
3+
This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`.
4+
5+
The default pipeline deploys with vLLM as the LLM serving component and leverages rerank component. It also provides options of not using rerank in the pipeline and using TGI backend for LLM microservice, please refer to [start-all-the-services-docker-containers](#start-all-the-services-docker-containers) section in this page. Besides, refer to [Build with Pinecone VectorDB](./README_pinecone.md) and [Build with Qdrant VectorDB](./README_qdrant.md) for other deployment variants.
46

57
Quick Start:
68

@@ -186,14 +188,17 @@ By default, the embedding, reranking and LLM models are set to a default value a
186188

187189
Change the `xxx_MODEL_ID` below for your needs.
188190

189-
For users in China who are unable to download models directly from Huggingface, you can use [ModelScope](https://www.modelscope.cn/models) or a Huggingface mirror to download models. TGI can load the models either online or offline as described below:
191+
For users in China who are unable to download models directly from Huggingface, you can use [ModelScope](https://www.modelscope.cn/models) or a Huggingface mirror to download models. The vLLM/TGI can load the models either online or offline as described below:
190192

191193
1. Online
192194

193195
```bash
194196
export HF_TOKEN=${your_hf_token}
195197
export HF_ENDPOINT="https://hf-mirror.com"
196198
model_name="Intel/neural-chat-7b-v3-3"
199+
# Start vLLM LLM Service
200+
docker run -p 8008:80 -v ./data:/data --name vllm-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 128g opea/vllm:latest --model $model_name --host 0.0.0.0 --port 80
201+
# Start TGI LLM Service
197202
docker run -p 8008:80 -v ./data:/data --name tgi-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 1g ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu --model-id $model_name
198203
```
199204

@@ -203,12 +208,15 @@ For users in China who are unable to download models directly from Huggingface,
203208

204209
- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.
205210

206-
- Run the following command to start TGI service.
211+
- Run the following command to start the LLM service.
207212

208213
```bash
209214
export HF_TOKEN=${your_hf_token}
210215
export model_path="/path/to/model"
211-
docker run -p 8008:80 -v $model_path:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu --model-id /data
216+
# Start vLLM LLM Service
217+
docker run -p 8008:80 -v $model_path:/data --name vllm-service --shm-size 128g opea/vllm:latest --model /data --host 0.0.0.0 --port 80
218+
# Start TGI LLM Service
219+
docker run -p 8008:80 -v $model_path:/data --name tgi-service --shm-size 1g ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu --model-id /data
212220
```
213221

214222
### Setup Environment Variables
@@ -246,7 +254,7 @@ For users in China who are unable to download models directly from Huggingface,
246254
cd GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon/
247255
```
248256

249-
If use TGI backend.
257+
If use vLLM as the LLM serving backend.
250258

251259
```bash
252260
# Start ChatQnA with Rerank Pipeline
@@ -255,10 +263,10 @@ docker compose -f compose.yaml up -d
255263
docker compose -f compose_without_rerank.yaml up -d
256264
```
257265

258-
If use vLLM backend.
266+
If use TGI as the LLM serving backend.
259267

260268
```bash
261-
docker compose -f compose_vllm.yaml up -d
269+
docker compose -f compose_tgi.yaml up -d
262270
```
263271

264272
### Validate Microservices
@@ -305,37 +313,34 @@ For details on how to verify the correctness of the response, refer to [how-to-v
305313

306314
4. LLM backend Service
307315

308-
In first startup, this service will take more time to download the model files. After it's finished, the service will be ready.
316+
In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
309317
310318
Try the command below to check whether the LLM serving is ready.
311319
312320
```bash
313-
docker logs tgi-service | grep Connected
321+
# vLLM service
322+
docker logs vllm-service 2>&1 | grep complete
323+
# If the service is ready, you will get the response like below.
324+
INFO: Application startup complete.
314325
```
315326
316-
If the service is ready, you will get the response like below.
317-
318-
```
327+
```bash
328+
# TGI service
329+
docker logs tgi-service | grep Connected
330+
# If the service is ready, you will get the response like below.
319331
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
320332
```
321333
322334
Then try the `cURL` command below to validate services.
323335
324336
```bash
325-
# TGI service
337+
# either vLLM or TGI service
326338
curl http://${host_ip}:9009/v1/chat/completions \
327339
-X POST \
328340
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
329341
-H 'Content-Type: application/json'
330342
```
331343
332-
```bash
333-
# vLLM Service
334-
curl http://${host_ip}:9009/v1/chat/completions \
335-
-H "Content-Type: application/json" \
336-
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}]}'
337-
```
338-
339344
5. MegaService
340345
341346
```bash
@@ -362,7 +367,6 @@ Or run this command to get the file on a terminal.
362367
363368
```bash
364369
wget https://raw.githubusercontent.com/opea-project/GenAIComps/v1.1/comps/retrievers/redis/data/nke-10k-2023.pdf
365-
366370
```
367371
368372
Upload:

ChatQnA/docker_compose/intel/cpu/xeon/README_pinecone.md

Lines changed: 11 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Build Mega Service of ChatQnA on Xeon
22

3-
This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`. We will publish the Docker images to Docker Hub soon, it will simplify the deployment process for this service.
3+
This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`.
4+
5+
The default pipeline deploys with vLLM as the LLM serving component and leverages rerank component.
46

57
Quick Start:
68

@@ -189,15 +191,15 @@ By default, the embedding, reranking and LLM models are set to a default value a
189191

190192
Change the `xxx_MODEL_ID` below for your needs.
191193

192-
For users in China who are unable to download models directly from Huggingface, you can use [ModelScope](https://www.modelscope.cn/models) or a Huggingface mirror to download models. TGI can load the models either online or offline as described below:
194+
For users in China who are unable to download models directly from Huggingface, you can use [ModelScope](https://www.modelscope.cn/models) or a Huggingface mirror to download models. The vLLM can load the models either online or offline as described below:
193195

194196
1. Online
195197

196198
```bash
197199
export HF_TOKEN=${your_hf_token}
198200
export HF_ENDPOINT="https://hf-mirror.com"
199201
model_name="Intel/neural-chat-7b-v3-3"
200-
docker run -p 8008:80 -v ./data:/data --name tgi-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 1g ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu --model-id $model_name
202+
docker run -p 8008:80 -v ./data:/data --name vllm-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 128g opea/vllm:latest --model $model_name --host 0.0.0.0 --port 80
201203
```
202204

203205
2. Offline
@@ -206,12 +208,12 @@ For users in China who are unable to download models directly from Huggingface,
206208

207209
- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.
208210

209-
- Run the following command to start TGI service.
211+
- Run the following command to start the LLM service.
210212

211213
```bash
212214
export HF_TOKEN=${your_hf_token}
213215
export model_path="/path/to/model"
214-
docker run -p 8008:80 -v $model_path:/data --name tgi_service --shm-size 1g ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu --model-id /data
216+
docker run -p 8008:80 -v $model_path:/data --name vllm-service --shm-size 128g opea/vllm:latest --model /data --host 0.0.0.0 --port 80
215217
```
216218

217219
### Setup Environment Variables
@@ -252,7 +254,7 @@ For users in China who are unable to download models directly from Huggingface,
252254
cd GenAIExamples/ChatQnA/docker_compose/intel/cpu/xeon/
253255
```
254256

255-
If use TGI backend.
257+
If use vLLM backend.
256258

257259
```bash
258260
# Start ChatQnA with Rerank Pipeline
@@ -303,24 +305,23 @@ For details on how to verify the correctness of the response, refer to [how-to-v
303305

304306
4. LLM backend Service
305307

306-
In first startup, this service will take more time to download the model files. After it's finished, the service will be ready.
308+
In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
307309
308310
Try the command below to check whether the LLM serving is ready.
309311
310312
```bash
311-
docker logs tgi-service | grep Connected
313+
docker logs vllm-service 2>&1 | grep complete
312314
```
313315
314316
If the service is ready, you will get the response like below.
315317
316318
```text
317-
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
319+
INFO: Application startup complete.
318320
```
319321
320322
Then try the `cURL` command below to validate services.
321323
322324
```bash
323-
# TGI service
324325
curl http://${host_ip}:9009/v1/chat/completions \
325326
-X POST \
326327
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \

ChatQnA/docker_compose/intel/cpu/xeon/README_qdrant.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Build Mega Service of ChatQnA (with Qdrant) on Xeon
22

3-
This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`. We will publish the Docker images to Docker Hub soon, it will simplify the deployment process for this service.
3+
This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Xeon server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`.
4+
5+
The default pipeline deploys with vLLM as the LLM serving component and leverages rerank component.
46

57
## 🚀 Apply Xeon Server on AWS
68

@@ -44,7 +46,7 @@ reranking
4446
=========
4547
Port 6046 - Open to 0.0.0.0/0
4648
47-
tgi-service
49+
vllm-service
4850
===========
4951
Port 6042 - Open to 0.0.0.0/0
5052
@@ -170,7 +172,7 @@ export your_hf_api_token="Your_Huggingface_API_Token"
170172
**Append the value of the public IP address to the no_proxy list if you are in a proxy environment**
171173

172174
```
173-
export your_no_proxy=${your_no_proxy},"External_Public_IP",chatqna-xeon-ui-server,chatqna-xeon-backend-server,dataprep-qdrant-service,tei-embedding-service,retriever,tei-reranking-service,tgi-service
175+
export your_no_proxy=${your_no_proxy},"External_Public_IP",chatqna-xeon-ui-server,chatqna-xeon-backend-server,dataprep-qdrant-service,tei-embedding-service,retriever,tei-reranking-service,tgi-service,vllm-service
174176
```
175177

176178
```bash
@@ -233,23 +235,23 @@ For details on how to verify the correctness of the response, refer to [how-to-v
233235
-H 'Content-Type: application/json'
234236
```
235237

236-
4. TGI Service
238+
4. LLM Backend Service
237239

238-
In first startup, this service will take more time to download the model files. After it's finished, the service will be ready.
240+
In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
239241

240-
Try the command below to check whether the TGI service is ready.
242+
Try the command below to check whether the LLM service is ready.
241243

242244
```bash
243-
docker logs ${CONTAINER_ID} | grep Connected
245+
docker logs vllm-service 2>&1 | grep complete
244246
```
245247

246248
If the service is ready, you will get the response like below.
247249

248-
```
249-
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
250+
```text
251+
INFO: Application startup complete.
250252
```
251253

252-
Then try the `cURL` command below to validate TGI.
254+
Then try the `cURL` command below to validate vLLM service.
253255

254256
```bash
255257
curl http://${host_ip}:6042/v1/chat/completions \

ChatQnA/docker_compose/intel/cpu/xeon/compose.yaml

Lines changed: 9 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -74,32 +74,31 @@ services:
7474
HF_HUB_DISABLE_PROGRESS_BARS: 1
7575
HF_HUB_ENABLE_HF_TRANSFER: 0
7676
command: --model-id ${RERANK_MODEL_ID} --auto-truncate
77-
tgi-service:
78-
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
79-
container_name: tgi-service
77+
vllm-service:
78+
image: ${REGISTRY:-opea}/vllm:${TAG:-latest}
79+
container_name: vllm-service
8080
ports:
8181
- "9009:80"
8282
volumes:
8383
- "./data:/data"
84-
shm_size: 1g
84+
shm_size: 128g
8585
environment:
8686
no_proxy: ${no_proxy}
8787
http_proxy: ${http_proxy}
8888
https_proxy: ${https_proxy}
8989
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
90-
HF_HUB_DISABLE_PROGRESS_BARS: 1
91-
HF_HUB_ENABLE_HF_TRANSFER: 0
92-
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
90+
LLM_MODEL_ID: ${LLM_MODEL_ID}
91+
VLLM_TORCH_PROFILER_DIR: "/mnt"
92+
command: --model $LLM_MODEL_ID --host 0.0.0.0 --port 80
9393
chatqna-xeon-backend-server:
9494
image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
9595
container_name: chatqna-xeon-backend-server
9696
depends_on:
9797
- redis-vector-db
9898
- tei-embedding-service
99-
- dataprep-redis-service
10099
- retriever
101100
- tei-reranking-service
102-
- tgi-service
101+
- vllm-service
103102
ports:
104103
- "8888:8888"
105104
environment:
@@ -112,7 +111,7 @@ services:
112111
- RETRIEVER_SERVICE_HOST_IP=retriever
113112
- RERANK_SERVER_HOST_IP=tei-reranking-service
114113
- RERANK_SERVER_PORT=${RERANK_SERVER_PORT:-80}
115-
- LLM_SERVER_HOST_IP=tgi-service
114+
- LLM_SERVER_HOST_IP=vllm-service
116115
- LLM_SERVER_PORT=${LLM_SERVER_PORT:-80}
117116
- LLM_MODEL=${LLM_MODEL_ID}
118117
- LOGFLAG=${LOGFLAG}

ChatQnA/docker_compose/intel/cpu/xeon/compose_pinecone.yaml

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -68,22 +68,22 @@ services:
6868
HF_HUB_DISABLE_PROGRESS_BARS: 1
6969
HF_HUB_ENABLE_HF_TRANSFER: 0
7070
command: --model-id ${RERANK_MODEL_ID} --auto-truncate
71-
tgi-service:
72-
image: ghcr.io/huggingface/text-generation-inference:2.4.0-intel-cpu
73-
container_name: tgi-service
71+
vllm-service:
72+
image: ${REGISTRY:-opea}/vllm:${TAG:-latest}
73+
container_name: vllm-service
7474
ports:
7575
- "9009:80"
7676
volumes:
7777
- "./data:/data"
78-
shm_size: 1g
78+
shm_size: 128g
7979
environment:
8080
no_proxy: ${no_proxy}
8181
http_proxy: ${http_proxy}
8282
https_proxy: ${https_proxy}
8383
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
84-
HF_HUB_DISABLE_PROGRESS_BARS: 1
85-
HF_HUB_ENABLE_HF_TRANSFER: 0
86-
command: --model-id ${LLM_MODEL_ID} --cuda-graphs 0
84+
LLM_MODEL_ID: ${LLM_MODEL_ID}
85+
VLLM_TORCH_PROFILER_DIR: "/mnt"
86+
command: --model $LLM_MODEL_ID --host 0.0.0.0 --port 80
8787
chatqna-xeon-backend-server:
8888
image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
8989
container_name: chatqna-xeon-backend-server
@@ -92,7 +92,7 @@ services:
9292
- dataprep-pinecone-service
9393
- retriever
9494
- tei-reranking-service
95-
- tgi-service
95+
- vllm-service
9696
ports:
9797
- "8888:8888"
9898
environment:
@@ -105,7 +105,7 @@ services:
105105
- RETRIEVER_SERVICE_HOST_IP=retriever
106106
- RERANK_SERVER_HOST_IP=tei-reranking-service
107107
- RERANK_SERVER_PORT=${RERANK_SERVER_PORT:-80}
108-
- LLM_SERVER_HOST_IP=tgi-service
108+
- LLM_SERVER_HOST_IP=vllm-service
109109
- LLM_SERVER_PORT=${LLM_SERVER_PORT:-80}
110110
- LOGFLAG=${LOGFLAG}
111111
- LLM_MODEL=${LLM_MODEL_ID}

0 commit comments

Comments
 (0)