Skip to content

Commit 00e9da9

Browse files
authored
[ChatQnA] Switch to vLLM as default llm backend on Gaudi (#1404)
Switching from TGI to vLLM as the default LLM serving backend on Gaudi for the ChatQnA example to enhance the perf. #1213 Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
1 parent 277222a commit 00e9da9

11 files changed

+411
-401
lines changed

ChatQnA/README.md

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,7 @@ Gaudi default compose.yaml
202202
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
203203
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/retrieval |
204204
| Reranking | Langchain, TEI | Gaudi | 8000 | /v1/reranking |
205-
| LLM | Langchain, TGI | Gaudi | 9000 | /v1/chat/completions |
205+
| LLM | Langchain, vLLM | Gaudi | 9000 | /v1/chat/completions |
206206
| Dataprep | Redis, Langchain | Xeon | 6007 | /v1/dataprep |
207207

208208
### Required Models
@@ -266,16 +266,21 @@ Refer to the [Intel Technology enabling for Openshift readme](https://github.com
266266

267267
### Check Service Status
268268

269-
Before consuming ChatQnA Service, make sure the TGI/vLLM service is ready (which takes up to 2 minutes to start).
269+
Before consuming ChatQnA Service, make sure the vLLM/TGI service is ready, which takes some time.
270270

271271
```bash
272+
# vLLM example
273+
docker logs vllm-gaudi-server 2>&1 | grep complete
272274
# TGI example
273-
docker logs tgi-service | grep Connected
275+
docker logs tgi-gaudi-server | grep Connected
274276
```
275277

276-
Consume ChatQnA service until you get the TGI response like below.
278+
Consume ChatQnA service until you get the response like below.
277279

278280
```log
281+
# vLLM
282+
INFO: Application startup complete.
283+
# TGI
279284
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
280285
```
281286

ChatQnA/docker_compose/intel/hpu/gaudi/README.md

Lines changed: 31 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,8 @@
11
# Build MegaService of ChatQnA on Gaudi
22

3-
This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Gaudi server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as embedding, retriever, rerank, and llm. We will publish the Docker images to Docker Hub, it will simplify the deployment process for this service.
3+
This document outlines the deployment process for a ChatQnA application utilizing the [GenAIComps](https://github.com/opea-project/GenAIComps.git) microservice pipeline on Intel Gaudi server. The steps include Docker image creation, container deployment via Docker Compose, and service execution to integrate microservices such as `embedding`, `retriever`, `rerank`, and `llm`.
4+
5+
The default pipeline deploys with vLLM as the LLM serving component and leverages rerank component. It also provides options of not using rerank in the pipeline, leveraging guardrails, or using TGI backend for LLM microservice, please refer to [start-all-the-services-docker-containers](#start-all-the-services-docker-containers) section in this page.
46

57
Quick Start:
68

@@ -184,15 +186,18 @@ By default, the embedding, reranking and LLM models are set to a default value a
184186

185187
Change the `xxx_MODEL_ID` below for your needs.
186188

187-
For users in China who are unable to download models directly from Huggingface, you can use [ModelScope](https://www.modelscope.cn/models) or a Huggingface mirror to download models. TGI can load the models either online or offline as described below:
189+
For users in China who are unable to download models directly from Huggingface, you can use [ModelScope](https://www.modelscope.cn/models) or a Huggingface mirror to download models. The vLLM/TGI can load the models either online or offline as described below:
188190

189191
1. Online
190192

191193
```bash
192194
export HF_TOKEN=${your_hf_token}
193195
export HF_ENDPOINT="https://hf-mirror.com"
194196
model_name="Intel/neural-chat-7b-v3-3"
195-
docker run -p 8008:80 -v ./data:/data --name tgi-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id $model_name --max-input-tokens 1024 --max-total-tokens 2048
197+
# Start vLLM LLM Service
198+
docker run -p 8007:80 -v ./data:/data --name vllm-gaudi-server -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e VLLM_TORCH_PROFILER_DIR="/mnt" --cap-add=sys_nice --ipc=host opea/vllm-gaudi:latest --model $model_name --tensor-parallel-size 1 --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048
199+
# Start TGI LLM Service
200+
docker run -p 8005:80 -v ./data:/data --name tgi-gaudi-server -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id $model_name --max-input-tokens 1024 --max-total-tokens 2048
196201
```
197202

198203
2. Offline
@@ -201,12 +206,15 @@ For users in China who are unable to download models directly from Huggingface,
201206

202207
- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.
203208

204-
- Run the following command to start TGI service.
209+
- Run the following command to start the LLM service.
205210

206211
```bash
207212
export HF_TOKEN=${your_hf_token}
208213
export model_path="/path/to/model"
209-
docker run -p 8008:80 -v $model_path:/data --name tgi_service --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id /data --max-input-tokens 1024 --max-total-tokens 2048
214+
# Start vLLM LLM Service
215+
docker run -p 8007:80 -v $model_path:/data --name vllm-gaudi-server --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e VLLM_TORCH_PROFILER_DIR="/mnt" --cap-add=sys_nice --ipc=host opea/vllm-gaudi:latest --model /data --tensor-parallel-size 1 --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048
216+
# Start TGI LLM Service
217+
docker run -p 8005:80 -v $model_path:/data --name tgi-gaudi-server --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e ENABLE_HPU_GRAPH=true -e LIMIT_HPU_GRAPH=true -e USE_FLASH_ATTENTION=true -e FLASH_ATTENTION_RECOMPUTE=true --cap-add=sys_nice --ipc=host ghcr.io/huggingface/tgi-gaudi:2.0.6 --model-id /data --max-input-tokens 1024 --max-total-tokens 2048
210218
```
211219

212220
### Setup Environment Variables
@@ -242,7 +250,7 @@ For users in China who are unable to download models directly from Huggingface,
242250
cd GenAIExamples/ChatQnA/docker_compose/intel/hpu/gaudi/
243251
```
244252

245-
If use tgi for llm backend.
253+
If use vLLM as the LLM serving backend.
246254

247255
```bash
248256
# Start ChatQnA with Rerank Pipeline
@@ -251,10 +259,10 @@ docker compose -f compose.yaml up -d
251259
docker compose -f compose_without_rerank.yaml up -d
252260
```
253261

254-
If use vllm for llm backend.
262+
If use TGI as the LLM serving backend.
255263

256264
```bash
257-
docker compose -f compose_vllm.yaml up -d
265+
docker compose -f compose_tgi.yaml up -d
258266
```
259267

260268
If you want to enable guardrails microservice in the pipeline, please follow the below command instead:
@@ -309,35 +317,40 @@ For validation details, please refer to [how-to-validate_service](./how_to_valid
309317

310318
4. LLM backend Service
311319

312-
In first startup, this service will take more time to download the model files. After it's finished, the service will be ready.
320+
In the first startup, this service will take more time to download, load and warm up the model. After it's finished, the service will be ready.
313321
314322
Try the command below to check whether the LLM serving is ready.
315323
316324
```bash
317-
docker logs tgi-gaudi-server | grep Connected
325+
# vLLM service
326+
docker logs vllm-gaudi-server 2>&1 | grep complete
327+
# If the service is ready, you will get the response like below.
328+
INFO: Application startup complete.
318329
```
319330
331+
```bash
332+
# TGI service
333+
docker logs tgi-gaudi-server | grep Connected
320334
If the service is ready, you will get the response like below.
321-
322-
```
323335
2024-09-03T02:47:53.402023Z INFO text_generation_router::server: router/src/server.rs:2311: Connected
324336
```
325337
326338
Then try the `cURL` command below to validate services.
327339
328340
```bash
329-
# TGI service
330-
curl http://${host_ip}:8005/v1/chat/completions \
341+
# vLLM Service
342+
curl http://${host_ip}:8007/v1/chat/completions \
331343
-X POST \
332344
-d '{"model": ${LLM_MODEL_ID}, "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
333345
-H 'Content-Type: application/json'
334346
```
335347
336348
```bash
337-
# vLLM Service
338-
curl http://${host_ip}:8007/v1/chat/completions \
339-
-H "Content-Type: application/json" \
340-
-d '{"model": ${LLM_MODEL_ID}, "messages": [{"role": "user", "content": "What is Deep Learning?"}]}'
349+
# TGI service
350+
curl http://${host_ip}:8005/v1/chat/completions \
351+
-X POST \
352+
-d '{"model": ${LLM_MODEL_ID}, "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
353+
-H 'Content-Type: application/json'
341354
```
342355
343356
5. MegaService

ChatQnA/docker_compose/intel/hpu/gaudi/compose.yaml

Lines changed: 12 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,6 @@ services:
2525
INDEX_NAME: ${INDEX_NAME}
2626
TEI_ENDPOINT: http://tei-embedding-service:80
2727
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
28-
TELEMETRY_ENDPOINT: ${TELEMETRY_ENDPOINT}
2928
tei-embedding-service:
3029
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.5
3130
container_name: tei-embedding-gaudi-server
@@ -38,7 +37,7 @@ services:
3837
no_proxy: ${no_proxy}
3938
http_proxy: ${http_proxy}
4039
https_proxy: ${https_proxy}
41-
command: --model-id ${EMBEDDING_MODEL_ID} --auto-truncate --otlp-endpoint $OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
40+
command: --model-id ${EMBEDDING_MODEL_ID} --auto-truncate
4241
retriever:
4342
image: ${REGISTRY:-opea}/retriever:${TAG:-latest}
4443
container_name: retriever-redis-server
@@ -56,9 +55,6 @@ services:
5655
INDEX_NAME: ${INDEX_NAME}
5756
TEI_EMBEDDING_ENDPOINT: http://tei-embedding-service:80
5857
HUGGINGFACEHUB_API_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
59-
TELEMETRY_ENDPOINT: ${TELEMETRY_ENDPOINT}
60-
LOGFLAG: ${LOGFLAG}
61-
RETRIEVER_COMPONENT_NAME: "OPEA_RETRIEVER_REDIS"
6258
restart: unless-stopped
6359
tei-reranking-service:
6460
image: ghcr.io/huggingface/tei-gaudi:1.5.0
@@ -80,47 +76,28 @@ services:
8076
HABANA_VISIBLE_DEVICES: all
8177
OMPI_MCA_btl_vader_single_copy_mechanism: none
8278
MAX_WARMUP_SEQUENCE_LENGTH: 512
83-
command: --model-id ${RERANK_MODEL_ID} --auto-truncate --otlp-endpoint $OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
84-
tgi-service:
85-
image: ghcr.io/huggingface/tgi-gaudi:2.0.6
86-
container_name: tgi-gaudi-server
79+
command: --model-id ${RERANK_MODEL_ID} --auto-truncate
80+
vllm-service:
81+
image: ${REGISTRY:-opea}/vllm-gaudi:${TAG:-latest}
82+
container_name: vllm-gaudi-server
8783
ports:
88-
- "8005:80"
84+
- "8007:80"
8985
volumes:
9086
- "./data:/data"
9187
environment:
9288
no_proxy: ${no_proxy}
9389
http_proxy: ${http_proxy}
9490
https_proxy: ${https_proxy}
95-
HUGGING_FACE_HUB_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
96-
HF_HUB_DISABLE_PROGRESS_BARS: 1
97-
HF_HUB_ENABLE_HF_TRANSFER: 0
91+
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
9892
HABANA_VISIBLE_DEVICES: all
9993
OMPI_MCA_btl_vader_single_copy_mechanism: none
100-
ENABLE_HPU_GRAPH: true
101-
LIMIT_HPU_GRAPH: true
102-
USE_FLASH_ATTENTION: true
103-
FLASH_ATTENTION_RECOMPUTE: true
94+
LLM_MODEL_ID: ${LLM_MODEL_ID}
95+
VLLM_TORCH_PROFILER_DIR: "/mnt"
10496
runtime: habana
10597
cap_add:
10698
- SYS_NICE
10799
ipc: host
108-
command: --model-id ${LLM_MODEL_ID} --max-input-length 2048 --max-total-tokens 4096 --otlp-endpoint $OTEL_EXPORTER_OTLP_TRACES_ENDPOINT
109-
jaeger:
110-
image: jaegertracing/all-in-one:latest
111-
container_name: jaeger
112-
ports:
113-
- "16686:16686"
114-
- "4317:4317"
115-
- "4318:4318"
116-
- "9411:9411"
117-
ipc: host
118-
environment:
119-
no_proxy: ${no_proxy}
120-
http_proxy: ${http_proxy}
121-
https_proxy: ${https_proxy}
122-
COLLECTOR_ZIPKIN_HOST_PORT: 9411
123-
restart: unless-stopped
100+
command: --model $LLM_MODEL_ID --tensor-parallel-size 1 --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048
124101
chatqna-gaudi-backend-server:
125102
image: ${REGISTRY:-opea}/chatqna:${TAG:-latest}
126103
container_name: chatqna-gaudi-backend-server
@@ -129,7 +106,7 @@ services:
129106
- tei-embedding-service
130107
- retriever
131108
- tei-reranking-service
132-
- tgi-service
109+
- vllm-service
133110
ports:
134111
- "8888:8888"
135112
environment:
@@ -142,11 +119,10 @@ services:
142119
- RETRIEVER_SERVICE_HOST_IP=retriever
143120
- RERANK_SERVER_HOST_IP=tei-reranking-service
144121
- RERANK_SERVER_PORT=${RERANK_SERVER_PORT:-80}
145-
- LLM_SERVER_HOST_IP=tgi-service
122+
- LLM_SERVER_HOST_IP=vllm-service
146123
- LLM_SERVER_PORT=${LLM_SERVER_PORT:-80}
147124
- LLM_MODEL=${LLM_MODEL_ID}
148125
- LOGFLAG=${LOGFLAG}
149-
- TELEMETRY_ENDPOINT=${TELEMETRY_ENDPOINT}
150126
ipc: host
151127
restart: always
152128
chatqna-gaudi-ui-server:

ChatQnA/docker_compose/intel/hpu/gaudi/compose_guardrails.yaml

Lines changed: 9 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -118,9 +118,9 @@ services:
118118
OMPI_MCA_btl_vader_single_copy_mechanism: none
119119
MAX_WARMUP_SEQUENCE_LENGTH: 512
120120
command: --model-id ${RERANK_MODEL_ID} --auto-truncate
121-
tgi-service:
122-
image: ghcr.io/huggingface/tgi-gaudi:2.0.6
123-
container_name: tgi-gaudi-server
121+
vllm-service:
122+
image: ${REGISTRY:-opea}/vllm-gaudi:${TAG:-latest}
123+
container_name: vllm-gaudi-server
124124
ports:
125125
- "8008:80"
126126
volumes:
@@ -129,20 +129,16 @@ services:
129129
no_proxy: ${no_proxy}
130130
http_proxy: ${http_proxy}
131131
https_proxy: ${https_proxy}
132-
HUGGING_FACE_HUB_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
133-
HF_HUB_DISABLE_PROGRESS_BARS: 1
134-
HF_HUB_ENABLE_HF_TRANSFER: 0
132+
HF_TOKEN: ${HUGGINGFACEHUB_API_TOKEN}
135133
HABANA_VISIBLE_DEVICES: all
136134
OMPI_MCA_btl_vader_single_copy_mechanism: none
137-
ENABLE_HPU_GRAPH: true
138-
LIMIT_HPU_GRAPH: true
139-
USE_FLASH_ATTENTION: true
140-
FLASH_ATTENTION_RECOMPUTE: true
135+
LLM_MODEL_ID: ${LLM_MODEL_ID}
136+
VLLM_TORCH_PROFILER_DIR: "/mnt"
141137
runtime: habana
142138
cap_add:
143139
- SYS_NICE
144140
ipc: host
145-
command: --model-id ${LLM_MODEL_ID} --max-input-length 1024 --max-total-tokens 2048
141+
command: --model $LLM_MODEL_ID --tensor-parallel-size 1 --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048
146142
chatqna-gaudi-backend-server:
147143
image: ${REGISTRY:-opea}/chatqna-guardrails:${TAG:-latest}
148144
container_name: chatqna-gaudi-guardrails-server
@@ -153,7 +149,7 @@ services:
153149
- tei-embedding-service
154150
- retriever
155151
- tei-reranking-service
156-
- tgi-service
152+
- vllm-service
157153
ports:
158154
- "8888:8888"
159155
environment:
@@ -168,7 +164,7 @@ services:
168164
- RETRIEVER_SERVICE_HOST_IP=retriever
169165
- RERANK_SERVER_HOST_IP=tei-reranking-service
170166
- RERANK_SERVER_PORT=${RERANK_SERVER_PORT:-80}
171-
- LLM_SERVER_HOST_IP=tgi-service
167+
- LLM_SERVER_HOST_IP=vllm-service
172168
- LLM_SERVER_PORT=${LLM_SERVER_PORT:-80}
173169
- LLM_MODEL=${LLM_MODEL_ID}
174170
- LOGFLAG=${LOGFLAG}

0 commit comments

Comments
 (0)