Skip to content

Commit 3d3ac59

Browse files
authored
[ChatQnA] Update the default LLM to llama3-8B on cpu/gpu/hpu (#1430)
Update the default LLM to llama3-8B on cpu/nvgpu/amdgpu/gaudi for docker-compose deployment to avoid the potential model serving issue or the missing chat-template issue using neural-chat-7b. Slow serving issue of neural-chat-7b on ICX: #1420 Signed-off-by: Wang, Kai Lawrence <kai.lawrence.wang@intel.com>
1 parent f11ab45 commit 3d3ac59

25 files changed

+96
-80
lines changed

ChatQnA/README.md

Lines changed: 12 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ RAG bridges the knowledge gap by dynamically fetching relevant information from
88

99
| Cloud Provider | Intel Architecture | Intel Optimized Cloud Module for Terraform | Comments |
1010
| -------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
11-
| AWS | 4th Gen Intel Xeon with Intel AMX | [AWS Module](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-xeon-opea-chatqna) | Uses Intel/neural-chat-7b-v3-3 by default |
11+
| AWS | 4th Gen Intel Xeon with Intel AMX | [AWS Module](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-xeon-opea-chatqna) | Uses meta-llama/Meta-Llama-3-8B-Instruct by default |
1212
| AWS Falcon2-11B | 4th Gen Intel Xeon with Intel AMX | [AWS Module with Falcon11B](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-xeon-opea-chatqna-falcon11B) | Uses TII Falcon2-11B LLM Model |
1313
| GCP | 5th Gen Intel Xeon with Intel AMX | [GCP Module](https://github.com/intel/terraform-intel-gcp-vm/tree/main/examples/gen-ai-xeon-opea-chatqna) | Also supports Confidential AI by using Intel® TDX with 4th Gen Xeon |
1414
| Azure | 5th Gen Intel Xeon with Intel AMX | Work-in-progress | Work-in-progress |
@@ -25,7 +25,7 @@ Use this if you are not using Terraform and have provisioned your system with an
2525

2626
## Manually Deploy ChatQnA Service
2727

28-
The ChatQnA service can be effortlessly deployed on Intel Gaudi2, Intel Xeon Scalable Processors and Nvidia GPU.
28+
The ChatQnA service can be effortlessly deployed on Intel Gaudi2, Intel Xeon Scalable Processors,Nvidia GPU and AMD GPU.
2929

3030
Two types of ChatQnA pipeline are supported now: `ChatQnA with/without Rerank`. And the `ChatQnA without Rerank` pipeline (including Embedding, Retrieval, and LLM) is offered for Xeon customers who can not run rerank service on HPU yet require high performance and accuracy.
3131

@@ -35,7 +35,11 @@ Quick Start Deployment Steps:
3535
2. Run Docker Compose.
3636
3. Consume the ChatQnA Service.
3737

38-
Note: If you do not have docker installed you can run this script to install docker : `bash docker_compose/install_docker.sh`
38+
Note:
39+
40+
1. If you do not have docker installed you can run this script to install docker : `bash docker_compose/install_docker.sh`.
41+
42+
2. The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).
3943

4044
### Quick Start: 1.Setup Environment Variable
4145

@@ -209,11 +213,11 @@ Gaudi default compose.yaml
209213

210214
By default, the embedding, reranking and LLM models are set to a default value as listed below:
211215

212-
| Service | Model |
213-
| --------- | ------------------------- |
214-
| Embedding | BAAI/bge-base-en-v1.5 |
215-
| Reranking | BAAI/bge-reranker-base |
216-
| LLM | Intel/neural-chat-7b-v3-3 |
216+
| Service | Model |
217+
| --------- | ----------------------------------- |
218+
| Embedding | BAAI/bge-base-en-v1.5 |
219+
| Reranking | BAAI/bge-reranker-base |
220+
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |
217221

218222
Change the `xxx_MODEL_ID` in `docker_compose/xxx/set_env.sh` for your needs.
219223

ChatQnA/chatqna.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,7 @@ def generate_rag_prompt(question, documents):
5757
RERANK_SERVER_PORT = int(os.getenv("RERANK_SERVER_PORT", 80))
5858
LLM_SERVER_HOST_IP = os.getenv("LLM_SERVER_HOST_IP", "0.0.0.0")
5959
LLM_SERVER_PORT = int(os.getenv("LLM_SERVER_PORT", 80))
60-
LLM_MODEL = os.getenv("LLM_MODEL", "Intel/neural-chat-7b-v3-3")
60+
LLM_MODEL = os.getenv("LLM_MODEL", "meta-llama/Meta-Llama-3-8B-Instruct")
6161

6262

6363
def align_inputs(self, inputs, cur_node, runtime_graph, llm_parameters_dict, **kwargs):

ChatQnA/docker_compose/amd/gpu/rocm/README.md

Lines changed: 8 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Quick Start Deployment Steps:
1010
2. Run Docker Compose.
1111
3. Consume the ChatQnA Service.
1212

13+
Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).
14+
1315
## Quick Start: 1.Setup Environment Variable
1416

1517
To set up environment variables for deploying ChatQnA services, follow these steps:
@@ -155,11 +157,11 @@ Then run the command `docker images`, you will have the following 5 Docker Image
155157

156158
By default, the embedding, reranking and LLM models are set to a default value as listed below:
157159

158-
| Service | Model |
159-
| --------- | ------------------------- |
160-
| Embedding | BAAI/bge-base-en-v1.5 |
161-
| Reranking | BAAI/bge-reranker-base |
162-
| LLM | Intel/neural-chat-7b-v3-3 |
160+
| Service | Model |
161+
| --------- | ----------------------------------- |
162+
| Embedding | BAAI/bge-base-en-v1.5 |
163+
| Reranking | BAAI/bge-reranker-base |
164+
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |
163165

164166
Change the `xxx_MODEL_ID` below for your needs.
165167

@@ -179,7 +181,7 @@ Change the `xxx_MODEL_ID` below for your needs.
179181
export CHATQNA_TGI_SERVICE_IMAGE="ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
180182
export CHATQNA_EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
181183
export CHATQNA_RERANK_MODEL_ID="BAAI/bge-reranker-base"
182-
export CHATQNA_LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
184+
export CHATQNA_LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
183185
export CHATQNA_TGI_SERVICE_PORT=8008
184186
export CHATQNA_TEI_EMBEDDING_PORT=8090
185187
export CHATQNA_TEI_EMBEDDING_ENDPOINT="http://${HOST_IP}:${CHATQNA_TEI_EMBEDDING_PORT}"

ChatQnA/docker_compose/amd/gpu/rocm/set_env.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
export CHATQNA_TGI_SERVICE_IMAGE="ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
77
export CHATQNA_EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
88
export CHATQNA_RERANK_MODEL_ID="BAAI/bge-reranker-base"
9-
export CHATQNA_LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
9+
export CHATQNA_LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
1010
export CHATQNA_TGI_SERVICE_PORT=18008
1111
export CHATQNA_TEI_EMBEDDING_PORT=18090
1212
export CHATQNA_TEI_EMBEDDING_ENDPOINT="http://${HOST_IP}:${CHATQNA_TEI_EMBEDDING_PORT}"

ChatQnA/docker_compose/intel/cpu/xeon/README.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Quick Start:
1010
2. Run Docker Compose.
1111
3. Consume the ChatQnA Service.
1212

13+
Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).
14+
1315
## Quick Start: 1.Setup Environment Variable
1416

1517
To set up environment variables for deploying ChatQnA services, follow these steps:
@@ -180,11 +182,11 @@ Then run the command `docker images`, you will have the following 5 Docker Image
180182

181183
By default, the embedding, reranking and LLM models are set to a default value as listed below:
182184

183-
| Service | Model |
184-
| --------- | ------------------------- |
185-
| Embedding | BAAI/bge-base-en-v1.5 |
186-
| Reranking | BAAI/bge-reranker-base |
187-
| LLM | Intel/neural-chat-7b-v3-3 |
185+
| Service | Model |
186+
| --------- | ----------------------------------- |
187+
| Embedding | BAAI/bge-base-en-v1.5 |
188+
| Reranking | BAAI/bge-reranker-base |
189+
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |
188190

189191
Change the `xxx_MODEL_ID` below for your needs.
190192

@@ -195,7 +197,7 @@ For users in China who are unable to download models directly from Huggingface,
195197
```bash
196198
export HF_TOKEN=${your_hf_token}
197199
export HF_ENDPOINT="https://hf-mirror.com"
198-
model_name="Intel/neural-chat-7b-v3-3"
200+
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
199201
# Start vLLM LLM Service
200202
docker run -p 8008:80 -v ./data:/data --name vllm-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 128g opea/vllm:latest --model $model_name --host 0.0.0.0 --port 80
201203
# Start TGI LLM Service
@@ -204,7 +206,7 @@ For users in China who are unable to download models directly from Huggingface,
204206

205207
2. Offline
206208

207-
- Search your model name in ModelScope. For example, check [this page](https://www.modelscope.cn/models/ai-modelscope/neural-chat-7b-v3-1/files) for model `neural-chat-7b-v3-1`.
209+
- Search your model name in ModelScope. For example, check [this page](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files) for model `Meta-Llama-3-8B-Instruct`.
208210

209211
- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.
210212

@@ -337,7 +339,7 @@ For details on how to verify the correctness of the response, refer to [how-to-v
337339
# either vLLM or TGI service
338340
curl http://${host_ip}:9009/v1/chat/completions \
339341
-X POST \
340-
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
342+
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
341343
-H 'Content-Type: application/json'
342344
```
343345
@@ -450,7 +452,7 @@ Users could follow previous section to testing vLLM microservice or ChatQnA Mega
450452
```bash
451453
curl http://${host_ip}:9009/start_profile \
452454
-H "Content-Type: application/json" \
453-
-d '{"model": "Intel/neural-chat-7b-v3-3"}'
455+
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct"}'
454456
```
455457
456458
Users would see below docker logs from vllm-service if profiling is started correctly.
@@ -473,7 +475,7 @@ By following command, users could stop vLLM profliing and generate a \*.pt.trace
473475
# vLLM Service
474476
curl http://${host_ip}:9009/stop_profile \
475477
-H "Content-Type: application/json" \
476-
-d '{"model": "Intel/neural-chat-7b-v3-3"}'
478+
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct"}'
477479
```
478480
479481
Users would see below docker logs from vllm-service if profiling is stopped correctly.

ChatQnA/docker_compose/intel/cpu/xeon/README_pinecone.md

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Quick Start:
1010
2. Run Docker Compose.
1111
3. Consume the ChatQnA Service.
1212

13+
Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).
14+
1315
## Quick Start: 1.Setup Environment Variable
1416

1517
To set up environment variables for deploying ChatQnA services, follow these steps:
@@ -183,11 +185,11 @@ Then run the command `docker images`, you will have the following 5 Docker Image
183185

184186
By default, the embedding, reranking and LLM models are set to a default value as listed below:
185187

186-
| Service | Model |
187-
| --------- | ------------------------- |
188-
| Embedding | BAAI/bge-base-en-v1.5 |
189-
| Reranking | BAAI/bge-reranker-base |
190-
| LLM | Intel/neural-chat-7b-v3-3 |
188+
| Service | Model |
189+
| --------- | ----------------------------------- |
190+
| Embedding | BAAI/bge-base-en-v1.5 |
191+
| Reranking | BAAI/bge-reranker-base |
192+
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |
191193

192194
Change the `xxx_MODEL_ID` below for your needs.
193195

@@ -198,13 +200,13 @@ For users in China who are unable to download models directly from Huggingface,
198200
```bash
199201
export HF_TOKEN=${your_hf_token}
200202
export HF_ENDPOINT="https://hf-mirror.com"
201-
model_name="Intel/neural-chat-7b-v3-3"
203+
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
202204
docker run -p 8008:80 -v ./data:/data --name vllm-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 128g opea/vllm:latest --model $model_name --host 0.0.0.0 --port 80
203205
```
204206

205207
2. Offline
206208

207-
- Search your model name in ModelScope. For example, check [this page](https://www.modelscope.cn/models/ai-modelscope/neural-chat-7b-v3-1/files) for model `neural-chat-7b-v3-1`.
209+
- Search your model name in ModelScope. For example, check [this page](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files) for model `Meta-Llama-3-8B-Instruct`.
208210

209211
- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.
210212

@@ -324,7 +326,7 @@ For details on how to verify the correctness of the response, refer to [how-to-v
324326
```bash
325327
curl http://${host_ip}:9009/v1/chat/completions \
326328
-X POST \
327-
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
329+
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
328330
-H 'Content-Type: application/json'
329331
```
330332

ChatQnA/docker_compose/intel/cpu/xeon/README_qdrant.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,8 @@ This document outlines the deployment process for a ChatQnA application utilizin
44

55
The default pipeline deploys with vLLM as the LLM serving component and leverages rerank component.
66

7+
Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).
8+
79
## 🚀 Apply Xeon Server on AWS
810

911
To apply a Xeon server on AWS, start by creating an AWS account if you don't have one already. Then, head to the [EC2 Console](https://console.aws.amazon.com/ec2/v2/home) to begin the process. Within the EC2 service, select the Amazon EC2 M7i or M7i-flex instance type to leverage the power of 4th Generation Intel Xeon Scalable processors. These instances are optimized for high-performance computing and demanding workloads.
@@ -141,11 +143,11 @@ Then run the command `docker images`, you will have the following 5 Docker Image
141143

142144
By default, the embedding, reranking and LLM models are set to a default value as listed below:
143145

144-
| Service | Model |
145-
| --------- | ------------------------- |
146-
| Embedding | BAAI/bge-base-en-v1.5 |
147-
| Reranking | BAAI/bge-reranker-base |
148-
| LLM | Intel/neural-chat-7b-v3-3 |
146+
| Service | Model |
147+
| --------- | ----------------------------------- |
148+
| Embedding | BAAI/bge-base-en-v1.5 |
149+
| Reranking | BAAI/bge-reranker-base |
150+
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |
149151

150152
Change the `xxx_MODEL_ID` below for your needs.
151153

@@ -181,7 +183,7 @@ export http_proxy=${your_http_proxy}
181183
export https_proxy=${your_http_proxy}
182184
export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
183185
export RERANK_MODEL_ID="BAAI/bge-reranker-base"
184-
export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
186+
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
185187
export INDEX_NAME="rag-qdrant"
186188
```
187189

@@ -256,7 +258,7 @@ For details on how to verify the correctness of the response, refer to [how-to-v
256258
```bash
257259
curl http://${host_ip}:6042/v1/chat/completions \
258260
-X POST \
259-
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
261+
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
260262
-H 'Content-Type: application/json'
261263
```
262264

ChatQnA/docker_compose/intel/cpu/xeon/set_env.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ popd > /dev/null
99

1010
export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
1111
export RERANK_MODEL_ID="BAAI/bge-reranker-base"
12-
export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
12+
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
1313
export INDEX_NAME="rag-redis"
1414
# Set it as a non-null string, such as true, if you want to enable logging facility,
1515
# otherwise, keep it as "" to disable it.

ChatQnA/docker_compose/intel/hpu/gaudi/README.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,8 @@ Quick Start:
1010
2. Run Docker Compose.
1111
3. Consume the ChatQnA Service.
1212

13+
Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and been granted the access to it on [Huggingface](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).
14+
1315
## Quick Start: 1.Setup Environment Variable
1416

1517
To set up environment variables for deploying ChatQnA services, follow these steps:
@@ -178,11 +180,11 @@ If Guardrails docker image is built, you will find one more image:
178180

179181
By default, the embedding, reranking and LLM models are set to a default value as listed below:
180182

181-
| Service | Model |
182-
| --------- | ------------------------- |
183-
| Embedding | BAAI/bge-base-en-v1.5 |
184-
| Reranking | BAAI/bge-reranker-base |
185-
| LLM | Intel/neural-chat-7b-v3-3 |
183+
| Service | Model |
184+
| --------- | ----------------------------------- |
185+
| Embedding | BAAI/bge-base-en-v1.5 |
186+
| Reranking | BAAI/bge-reranker-base |
187+
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |
186188

187189
Change the `xxx_MODEL_ID` below for your needs.
188190

@@ -193,7 +195,7 @@ For users in China who are unable to download models directly from Huggingface,
193195
```bash
194196
export HF_TOKEN=${your_hf_token}
195197
export HF_ENDPOINT="https://hf-mirror.com"
196-
model_name="Intel/neural-chat-7b-v3-3"
198+
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
197199
# Start vLLM LLM Service
198200
docker run -p 8007:80 -v ./data:/data --name vllm-gaudi-server -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e VLLM_TORCH_PROFILER_DIR="/mnt" --cap-add=sys_nice --ipc=host opea/vllm-gaudi:latest --model $model_name --tensor-parallel-size 1 --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048
199201
# Start TGI LLM Service
@@ -202,7 +204,7 @@ For users in China who are unable to download models directly from Huggingface,
202204

203205
2. Offline
204206

205-
- Search your model name in ModelScope. For example, check [this page](https://www.modelscope.cn/models/ai-modelscope/neural-chat-7b-v3-1/files) for model `neural-chat-7b-v3-1`.
207+
- Search your model name in ModelScope. For example, check [this page](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files) for model `Meta-Llama-3-8B-Instruct`.
206208

207209
- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.
208210

0 commit comments

Comments
 (0)