Skip to content

Commit c8c6fa2

Browse files
bjzhjingletonghanpre-commit-ci[bot]
authored andcommitted
Provide unified scalable deployment and benchmarking support for exam… (#1315)
Signed-off-by: Cathy Zhang <cathy.zhang@intel.com> Signed-off-by: letonghan <letong.han@intel.com> Co-authored-by: letonghan <letong.han@intel.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> (cherry picked from commit ed16308)
1 parent 905a510 commit c8c6fa2

6 files changed

+1470
-0
lines changed

ChatQnA/benchmark_chatqna.yaml

+83
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Copyright (C) 2025 Intel Corporation
2+
# SPDX-License-Identifier: Apache-2.0
3+
4+
deploy:
5+
device: gaudi
6+
version: 1.1.0
7+
modelUseHostPath: /mnt/models
8+
HUGGINGFACEHUB_API_TOKEN: ""
9+
node: [1, 2, 4, 8]
10+
namespace: ""
11+
12+
services:
13+
backend:
14+
instance_num: [2, 2, 4, 8]
15+
cores_per_instance: ""
16+
memory_capacity: ""
17+
18+
teirerank:
19+
enabled: True
20+
model_id: ""
21+
replicaCount: [1, 1, 1, 1]
22+
cards_per_instance: 1
23+
24+
tei:
25+
model_id: ""
26+
replicaCount: [1, 2, 4, 8]
27+
cores_per_instance: ""
28+
memory_capacity: ""
29+
30+
llm:
31+
engine: tgi
32+
model_id: ""
33+
replicaCount: [7, 15, 31, 63]
34+
max_batch_size: [1, 2, 4, 8]
35+
max_input_length: ""
36+
max_total_tokens: ""
37+
max_batch_total_tokens: ""
38+
max_batch_prefill_tokens: ""
39+
cards_per_instance: 1
40+
41+
data-prep:
42+
replicaCount: [1, 1, 1, 1]
43+
cores_per_instance: ""
44+
memory_capacity: ""
45+
46+
retriever-usvc:
47+
replicaCount: [2, 2, 4, 8]
48+
cores_per_instance: ""
49+
memory_capacity: ""
50+
51+
redis-vector-db:
52+
replicaCount: [1, 1, 1, 1]
53+
cores_per_instance: ""
54+
memory_capacity: ""
55+
56+
chatqna-ui:
57+
replicaCount: [1, 1, 1, 1]
58+
59+
nginx:
60+
replicaCount: [1, 1, 1, 1]
61+
62+
benchmark:
63+
# http request behavior related fields
64+
concurrency: [1, 2, 4]
65+
totoal_query_num: [2048, 4096]
66+
duration: [5, 10] # unit minutes
67+
query_num_per_concurrency: [4, 8, 16]
68+
possion: True
69+
possion_arrival_rate: 1.0
70+
warmup_iterations: 10
71+
seed: 1024
72+
73+
# workload, all of the test cases will run for benchmark
74+
test_cases:
75+
- chatqnafixed
76+
- chatqna_qlist_pubmed:
77+
dataset: pub_med10 # pub_med10, pub_med100, pub_med1000
78+
user_queries: [1, 2, 4]
79+
query_token_size: 128 # if specified, means fixed query token size will be sent out
80+
81+
llm:
82+
# specify the llm output token size
83+
max_token_size: [128, 256]

README-deploy-benchmark.md

+69
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# ChatQnA Benchmarking
2+
3+
## Purpose
4+
5+
We aim to run these benchmarks and share them with the OPEA community for three primary reasons:
6+
7+
- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
8+
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
9+
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading LLMs, serving frameworks etc.
10+
11+
## Table of Contents
12+
13+
- [Prerequisites](#prerequisites)
14+
- [Overview](#overview)
15+
- [Using deploy_and_benchmark.py](#using-deploy_and_benchmark.py-recommended)
16+
- [Data Preparation](#data-preparation)
17+
- [Configuration](#configuration)
18+
19+
## Prerequisites
20+
21+
Before running the benchmarks, ensure you have:
22+
23+
1. **Kubernetes Environment**
24+
25+
- Kubernetes installation: Use [kubespray](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md) or other official Kubernetes installation guides
26+
- (Optional) [Kubernetes set up guide on Intel Gaudi product](https://github.com/opea-project/GenAIInfra/blob/main/README.md#setup-kubernetes-cluster)
27+
28+
2. **Configuration YAML**
29+
The configuration file (e.g., `./ChatQnA/benchmark_chatqna.yaml`) consists of two main sections: deployment and benchmarking. Required fields must be filled with valid values (like the Hugging Face token). For all other fields, you can either customize them according to your needs or leave them empty ("") to use the default values from the [helm charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts).
30+
31+
## Data Preparation
32+
33+
Before running benchmarks, you need to:
34+
35+
1. **Prepare Test Data**
36+
37+
- Download the retrieval file:
38+
```bash
39+
wget https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/data/upload_file.txt
40+
```
41+
- For the `chatqna_qlist_pubmed` test case, prepare `pubmed_${max_lines}.txt` by following this [README](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/stresscli/README_Pubmed_qlist.md)
42+
43+
2. **Prepare Model Files (Recommended)**
44+
```bash
45+
pip install -U "huggingface_hub[cli]"
46+
sudo mkdir -p /mnt/models
47+
sudo chmod 777 /mnt/models
48+
huggingface-cli download --cache-dir /mnt/models Intel/neural-chat-7b-v3-3
49+
```
50+
51+
## Overview
52+
53+
The benchmarking process consists of two main components: deployment and benchmarking. We provide `deploy_and_benchmark.py` as a unified entry point that combines both steps.
54+
55+
### Using deploy_and_benchmark.py (Recommended)
56+
57+
The script `deploy_and_benchmark.py` serves as the main entry point. Here's an example using ChatQnA configuration (you can replace it with any other example's configuration YAML file):
58+
59+
1. For a specific number of nodes:
60+
61+
```bash
62+
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1
63+
```
64+
65+
2. For all node configurations:
66+
```bash
67+
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml
68+
```
69+
This will iterate through the node list in your configuration YAML file, performing deployment and benchmarking for each node count.

0 commit comments

Comments
 (0)