Skip to content

Commit 218f21d

Browse files
Potabkxuedinge233
andauthored
[Benchmarks] Add qwen2.5-7b test (#763)
### What this PR does / why we need it? - Add qwen2.5-7b test - Optimize the documentation to be more developer-friendly Signed-off-by: xuedinge233 <damow890@gmail.com> Co-authored-by: xuedinge233 <damow890@gmail.com>
1 parent 19c8e13 commit 218f21d

File tree

4 files changed

+54
-6
lines changed

4 files changed

+54
-6
lines changed

benchmarks/README.md

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,41 +1,44 @@
11
# Introduction
2-
This document outlines the benchmarking process for vllm-ascend, designed to evaluate its performance under various workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance.To maintain consistency with the vllm community, we have reused the vllm community [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script.
2+
This document outlines the benchmarking methodology for vllm-ascend, aimed at evaluating the performance under a variety of workloads. The primary goal is to help developers assess whether their pull requests improve or degrade vllm-ascend's performance. To maintain alignment with vLLM, we use the [benchmark](https://github.com/vllm-project/vllm/tree/main/benchmarks) script provided by the vllm project.
3+
34
# Overview
45
**Benchmarking Coverage**: We measure latency, throughput, and fixed-QPS serving on the Atlas800I A2 (see [quick_start](../docs/source/quick_start.md) to learn more supported devices list), with different models(coming soon).
56
- Latency tests
67
- Input length: 32 tokens.
78
- Output length: 128 tokens.
89
- Batch size: fixed (8).
9-
- Models: llama-3.1 8B.
10+
- Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
1011
- Evaluation metrics: end-to-end latency (mean, median, p99).
1112

1213
- Throughput tests
1314
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
1415
- Output length: the corresponding output length of these 200 prompts.
1516
- Batch size: dynamically determined by vllm to achieve maximum throughput.
16-
- Models: llama-3.1 8B .
17+
- Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
1718
- Evaluation metrics: throughput.
1819
- Serving tests
1920
- Input length: randomly sample 200 prompts from ShareGPT dataset (with fixed random seed).
2021
- Output length: the corresponding output length of these 200 prompts.
2122
- Batch size: dynamically determined by vllm and the arrival pattern of the requests.
2223
- **Average QPS (query per second)**: 1, 4, 16 and inf. QPS = inf means all requests come at once. For other QPS values, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
23-
- Models: llama-3.1 8B.
24+
- Models: Meta-Llama-3.1-8B-Instruct, Qwen2.5-7B-Instruct.
2425
- Evaluation metrics: throughput, TTFT (time to the first token, with mean, median and p99), ITL (inter-token latency, with mean, median and p99).
2526

26-
**Benchmarking Duration**: about 800senond for single model.
27+
**Benchmarking Duration**: about 800 senond for single model.
2728

2829

2930
# Quick Use
3031
## Prerequisites
3132
Before running the benchmarks, ensure the following:
33+
3234
- vllm and vllm-ascend are installed and properly set up in an NPU environment, as these scripts are specifically designed for NPU devices.
35+
3336
- Install necessary dependencies for benchmarks:
3437
```
3538
pip install -r benchmarks/requirements-bench.txt
3639
```
3740
38-
- Models and datasets are cached locally to accelerate execution. Modify the paths in the JSON files located in benchmarks/tests accordingly. feel free to add your own models and parameters in the JSON to run your customized benchmarks.
41+
- For performance benchmark, it is recommended to set the [load-format](https://github.com/vllm-project/vllm-ascend/blob/5897dc5bbe321ca90c26225d0d70bff24061d04b/benchmarks/tests/latency-tests.json#L7) as `dummy`, It will construct random weights based on the passed model without downloading the weights from internet, which can greatly reduce the benchmark time. feel free to add your own models and parameters in the JSON to run your customized benchmarks.
3942
4043
## Run benchmarks
4144
The provided scripts automatically execute performance tests for serving, throughput, and latency. To start the benchmarking process, run command in the vllm-ascend root directory:

benchmarks/tests/latency-tests.json

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,5 +8,15 @@
88
"num_iters_warmup": 5,
99
"num_iters": 15
1010
}
11+
},
12+
{
13+
"test_name": "latency_qwen2_5_7B_tp1",
14+
"parameters": {
15+
"model": "Qwen/Qwen2.5-7B-Instruct",
16+
"tensor_parallel_size": 1,
17+
"load_format": "dummy",
18+
"num_iters_warmup": 5,
19+
"num_iters": 15
20+
}
1121
}
1222
]

benchmarks/tests/serving-tests.json

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -22,5 +22,29 @@
2222
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
2323
"num_prompts": 200
2424
}
25+
},
26+
{
27+
"test_name": "serving_qwen2_5_7B_tp1",
28+
"qps_list": [
29+
1,
30+
4,
31+
16,
32+
"inf"
33+
],
34+
"server_parameters": {
35+
"model": "Qwen/Qwen2.5-7B-Instruct",
36+
"tensor_parallel_size": 1,
37+
"swap_space": 16,
38+
"disable_log_stats": "",
39+
"disable_log_requests": "",
40+
"load_format": "dummy"
41+
},
42+
"client_parameters": {
43+
"model": "Qwen/Qwen2.5-7B-Instruct",
44+
"backend": "vllm",
45+
"dataset_name": "sharegpt",
46+
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
47+
"num_prompts": 200
48+
}
2549
}
2650
]

benchmarks/tests/throughput-tests.json

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,17 @@
99
"num_prompts": 200,
1010
"backend": "vllm"
1111
}
12+
},
13+
{
14+
"test_name": "throughput_qwen2_5_7B_tp1",
15+
"parameters": {
16+
"model": "Qwen/Qwen2.5-7B-Instruct",
17+
"tensor_parallel_size": 1,
18+
"load_format": "dummy",
19+
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
20+
"num_prompts": 200,
21+
"backend": "vllm"
22+
}
1223
}
1324
]
1425

0 commit comments

Comments
 (0)