Skip to content

Commit 257b633

Browse files
yao531441chyundunovDatamonsters
authored andcommitted
Refine readme of CodeGen (opea-project#1797)
Signed-off-by: Yao, Qing <qing.yao@intel.com> Signed-off-by: Chingis Yundunov <c.yundunov@datamonsters.com>
1 parent 082f5d5 commit 257b633

File tree

8 files changed

+1081
-1262
lines changed

8 files changed

+1081
-1262
lines changed

CodeGen/README.md

Lines changed: 57 additions & 163 deletions
Large diffs are not rendered by default.

CodeGen/benchmark/accuracy/README.md

Lines changed: 50 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -1,73 +1,77 @@
1-
# CodeGen Accuracy
1+
# CodeGen Accuracy Benchmark
2+
3+
## Table of Contents
4+
5+
- [Purpose](#purpose)
6+
- [Evaluation Framework](#evaluation-framework)
7+
- [Prerequisites](#prerequisites)
8+
- [Environment Setup](#environment-setup)
9+
- [Running the Accuracy Benchmark](#running-the-accuracy-benchmark)
10+
- [Understanding the Results](#understanding-the-results)
11+
12+
## Purpose
13+
14+
This guide explains how to evaluate the accuracy of a deployed CodeGen service using standardized code generation benchmarks. It helps quantify the model's ability to generate correct and functional code based on prompts.
215

316
## Evaluation Framework
417

5-
We evaluate accuracy by [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness). It is a framework for the evaluation of code generation models.
18+
We utilize the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness), a framework specifically designed for evaluating code generation models. It supports various standard benchmarks such as [HumanEval](https://huggingface.co/datasets/openai_humaneval), [MBPP](https://huggingface.co/datasets/mbpp), and others.
619

7-
## Evaluation FAQs
20+
## Prerequisites
821

9-
### Launch CodeGen microservice
22+
- A running CodeGen service accessible via an HTTP endpoint. Refer to the main [CodeGen README](../../README.md) for deployment options.
23+
- Python 3.8+ environment.
24+
- Git installed.
1025

11-
Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples/tree/main/CodeGen/README.md), follow the guide to deploy CodeGen megeservice.
26+
## Environment Setup
1227

13-
Use `curl` command to test codegen service and ensure that it has started properly
28+
1. **Clone the Evaluation Repository:**
1429

15-
```bash
16-
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
17-
curl $CODEGEN_ENDPOINT \
18-
-H "Content-Type: application/json" \
19-
-d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
30+
```shell
31+
git clone https://github.com/opea-project/GenAIEval
32+
cd GenAIEval
33+
```
2034

21-
```
35+
2. **Install Dependencies:**
36+
```shell
37+
pip install -r requirements.txt
38+
pip install -e .
39+
```
2240

23-
### Generation and Evaluation
41+
## Running the Accuracy Benchmark
2442

25-
For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
43+
1. **Set Environment Variables:**
44+
Replace `{your_ip}` with the IP address of your deployed CodeGen service and `{your_model_identifier}` with the identifier of the model being tested (e.g., `Qwen/CodeQwen1.5-7B-Chat`).
2645

27-
#### Environment
46+
```shell
47+
export CODEGEN_ENDPOINT="http://{your_ip}:7778/v1/codegen"
48+
export CODEGEN_MODEL="{your_model_identifier}"
49+
```
2850

29-
```shell
30-
git clone https://github.com/opea-project/GenAIEval
31-
cd GenAIEval
32-
pip install -r requirements.txt
33-
pip install -e .
51+
_Note: Port `7778` is the default for the CodeGen gateway; adjust if you customized it._
3452

35-
```
53+
2. **Execute the Benchmark Script:**
54+
The script will run the evaluation tasks (e.g., HumanEval by default) against the specified endpoint.
3655

37-
#### Evaluation
56+
```shell
57+
bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
58+
```
3859

39-
```
40-
export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
41-
export CODEGEN_MODEL=your_model
42-
bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
43-
```
60+
_Note: Currently, the framework runs the full task set by default. Using 'limit' parameters might affect result comparability._
4461

45-
**_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.
62+
## Understanding the Results
4663

47-
### accuracy Result
64+
The results will be printed to the console and saved in `evaluation_results.json`. A key metric is `pass@k`, which represents the percentage of problems solved correctly within `k` generated attempts (e.g., `pass@1` means solved on the first try).
4865

49-
Here is the tested result for your reference
66+
Example output snippet:
5067

5168
```json
5269
{
5370
"humaneval": {
5471
"pass@1": 0.7195121951219512
5572
},
5673
"config": {
57-
"prefix": "",
58-
"do_sample": true,
59-
"temperature": 0.2,
60-
"top_k": 0,
61-
"top_p": 0.95,
62-
"n_samples": 1,
63-
"eos": "<|endoftext|>",
64-
"seed": 0,
6574
"model": "Qwen/CodeQwen1.5-7B-Chat",
66-
"modeltype": "causal",
67-
"peft_model": null,
68-
"revision": null,
69-
"use_auth_token": false,
70-
"trust_remote_code": false,
7175
"tasks": "humaneval",
7276
"instruction_tokens": null,
7377
"batch_size": 1,
@@ -93,7 +97,9 @@ Here is the tested result for your reference
9397
"prompt": "prompt",
9498
"max_memory_per_gpu": null,
9599
"check_references": false,
96-
"codegen_url": "http://192.168.123.104:31234/v1/codegen"
100+
"codegen_url": "http://192.168.123.104:7778/v1/codegen"
97101
}
98102
}
99103
```
104+
105+
This indicates a `pass@1` score of approximately 72% on the HumanEval benchmark for the specified model via the CodeGen service endpoint.
Lines changed: 53 additions & 57 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,73 @@
1-
# CodeGen Benchmarking
1+
# CodeGen Performance Benchmark
22

3-
This folder contains a collection of scripts to enable inference benchmarking by leveraging a comprehensive benchmarking tool, [GenAIEval](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/README.md), that enables throughput analysis to assess inference performance.
3+
## Table of Contents
44

5-
By following this guide, you can run benchmarks on your deployment and share the results with the OPEA community.
5+
- [Purpose](#purpose)
6+
- [Benchmarking Tool](#benchmarking-tool)
7+
- [Metrics Measured](#metrics-measured)
8+
- [Prerequisites](#prerequisites)
9+
- [Running the Performance Benchmark](#running-the-performance-benchmark)
10+
- [Data Collection](#data-collection)
611

712
## Purpose
813

9-
We aim to run these benchmarks and share them with the OPEA community for three primary reasons:
14+
This guide describes how to benchmark the inference performance (throughput and latency) of a deployed CodeGen service. The results help understand the service's capacity under load and compare different deployment configurations or models. This benchmark primarily targets Kubernetes deployments but can be adapted for Docker.
1015

11-
- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
12-
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
13-
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading llms, serving frameworks etc.
16+
## Benchmarking Tool
1417

15-
## Metrics
18+
We use the [GenAIEval](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/README.md) tool for performance benchmarking, which simulates concurrent users sending requests to the service endpoint.
1619

17-
The benchmark will report the below metrics, including:
20+
## Metrics Measured
1821

19-
- Number of Concurrent Requests
20-
- End-to-End Latency: P50, P90, P99 (in milliseconds)
21-
- End-to-End First Token Latency: P50, P90, P99 (in milliseconds)
22-
- Average Next Token Latency (in milliseconds)
23-
- Average Token Latency (in milliseconds)
24-
- Requests Per Second (RPS)
25-
- Output Tokens Per Second
26-
- Input Tokens Per Second
22+
The benchmark reports several key performance indicators:
2723

28-
Results will be displayed in the terminal and saved as CSV file named `1_testspec.yaml`.
24+
- **Concurrency:** Number of concurrent requests simulated.
25+
- **End-to-End Latency:** Time from request submission to final response received (P50, P90, P99 in ms).
26+
- **End-to-End First Token Latency:** Time from request submission to first token received (P50, P90, P99 in ms).
27+
- **Average Next Token Latency:** Average time between subsequent generated tokens (in ms).
28+
- **Average Token Latency:** Average time per generated token (in ms).
29+
- **Requests Per Second (RPS):** Throughput of the service.
30+
- **Output Tokens Per Second:** Rate of token generation.
31+
- **Input Tokens Per Second:** Rate of token consumption.
2932

30-
## Getting Started
33+
## Prerequisites
3134

32-
We recommend using Kubernetes to deploy the CodeGen service, as it offers benefits such as load balancing and improved scalability. However, you can also deploy the service using Docker if that better suits your needs.
35+
- A running CodeGen service accessible via an HTTP endpoint. Refer to the main [CodeGen README](../../README.md) for deployment options (Kubernetes recommended for load balancing/scalability).
36+
- **If using Kubernetes:**
37+
- A working Kubernetes cluster (refer to OPEA K8s setup guides if needed).
38+
- `kubectl` configured to access the cluster from the node where the benchmark will run (typically the master node).
39+
- Ensure sufficient `ulimit` for network connections on worker nodes hosting the service pods (e.g., `LimitNOFILE=65536` or higher in containerd/docker config).
40+
- **General:**
41+
- Python 3.8+ on the node running the benchmark script.
42+
- Network access from the benchmark node to the CodeGen service endpoint.
3343

34-
### Prerequisites
44+
## Running the Performance Benchmark
3545

36-
- Install Kubernetes by following [this guide](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md).
46+
1. **Deploy CodeGen Service:** Ensure your CodeGen service is deployed and accessible. Note the service endpoint URL (e.g., obtained via `kubectl get svc` or your ingress configuration if using Kubernetes, or `http://{host_ip}:{port}` for Docker).
3747

38-
- Every node has direct internet access
39-
- Set up kubectl on the master node with access to the Kubernetes cluster.
40-
- Install Python 3.8+ on the master node for running GenAIEval.
41-
- Ensure all nodes have a local /mnt/models folder, which will be mounted by the pods.
42-
- Ensure that the container's ulimit can meet the the number of requests.
48+
2. **Configure Benchmark Parameters (Optional):**
49+
Set environment variables to customize the test queries and output directory. The `USER_QUERIES` variable defines the number of concurrent requests for each test run.
4350

44-
```bash
45-
# The way to modify the containered ulimit:
46-
sudo systemctl edit containerd
47-
# Add two lines:
48-
[Service]
49-
LimitNOFILE=65536:1048576
51+
```bash
52+
# Example: Four runs with 128 concurrent requests each
53+
export USER_QUERIES="[128, 128, 128, 128]"
54+
# Example: Output directory
55+
export TEST_OUTPUT_DIR="/tmp/benchmark_output"
56+
# Set the target endpoint URL
57+
export CODEGEN_ENDPOINT_URL="http://{your_service_ip_or_hostname}:{port}/v1/codegen"
58+
```
5059

51-
sudo systemctl daemon-reload; sudo systemctl restart containerd
52-
```
60+
_Replace `{your_service_ip_or_hostname}:{port}` with the actual accessible URL of your CodeGen gateway service._
5361

54-
### Test Steps
62+
3. **Execute the Benchmark Script:**
63+
Run the script, optionally specifying the number of Kubernetes nodes involved if relevant for reporting context (the script itself runs from one node).
64+
```bash
65+
# Clone GenAIExamples if you haven't already
66+
# cd GenAIExamples/CodeGen/benchmark/performance
67+
bash benchmark.sh # Add '-n <node_count>' if desired for logging purposes
68+
```
69+
_Ensure the `benchmark.sh` script is adapted to use `CODEGEN_ENDPOINT_URL` and potentially `USER_QUERIES`, `TEST_OUTPUT_DIR`._
5570

56-
Please deploy CodeGen service before benchmarking.
71+
## Data Collection
5772

58-
#### Run Benchmark Test
59-
60-
Before the benchmark, we can configure the number of test queries and test output directory by:
61-
62-
```bash
63-
export USER_QUERIES="[128, 128, 128, 128]"
64-
export TEST_OUTPUT_DIR="/tmp/benchmark_output"
65-
```
66-
67-
And then run the benchmark by:
68-
69-
```bash
70-
bash benchmark.sh -n <node_count>
71-
```
72-
73-
The argument `-n` refers to the number of test nodes.
74-
75-
#### Data collection
76-
77-
All the test results will come to this folder `/tmp/benchmark_output` configured by the environment variable `TEST_OUTPUT_DIR` in previous steps.
73+
Benchmark results will be displayed in the terminal upon completion. Detailed results, typically including raw data and summary statistics, will be saved in the directory specified by `TEST_OUTPUT_DIR` (defaulting to `/tmp/benchmark_output`). CSV files (e.g., `1_testspec.yaml.csv`) containing metrics for each run are usually generated here.

0 commit comments

Comments
 (0)