chyundunovDatamonsters
diff --git a/‎CodeGen/README.md
Lines changed: 57 additions & 163 deletions b/‎CodeGen/README.md
Lines changed: 57 additions & 163 deletions
diff --git a/‎CodeGen/benchmark/accuracy/README.md
Lines changed: 50 additions & 44 deletions b/‎CodeGen/benchmark/accuracy/README.md
Lines changed: 50 additions & 44 deletions
diff --git a/‎CodeGen/benchmark/performance/README.md
Lines changed: 53 additions & 57 deletions b/‎CodeGen/benchmark/performance/README.md
Lines changed: 53 additions & 57 deletions
@@ -1,73 +1,77 @@
-# CodeGen Accuracy
+# CodeGen Accuracy Benchmark
+
+## Table of Contents
+
+- [Purpose](#purpose)
+- [Evaluation Framework](#evaluation-framework)
+- [Prerequisites](#prerequisites)
+- [Environment Setup](#environment-setup)
+- [Running the Accuracy Benchmark](#running-the-accuracy-benchmark)
+- [Understanding the Results](#understanding-the-results)
+
+## Purpose
+
+This guide explains how to evaluate the accuracy of a deployed CodeGen service using standardized code generation benchmarks. It helps quantify the model's ability to generate correct and functional code based on prompts.
 
 ## Evaluation Framework
 
-We evaluate accuracy by [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness). It is a framework for the evaluation of code generation models.
+We utilize the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness), a framework specifically designed for evaluating code generation models. It supports various standard benchmarks such as [HumanEval](https://huggingface.co/datasets/openai_humaneval), [MBPP](https://huggingface.co/datasets/mbpp), and others.
 
-## Evaluation FAQs
+## Prerequisites
 
-### Launch CodeGen microservice
+- A running CodeGen service accessible via an HTTP endpoint. Refer to the main [CodeGen README](../../README.md) for deployment options.
+- Python 3.8+ environment.
+- Git installed.
 
-Please refer to [CodeGen Examples](https://github.com/opea-project/GenAIExamples/tree/main/CodeGen/README.md), follow the guide to deploy CodeGen megeservice.
+## Environment Setup
 
-Use `curl` command to test codegen service and ensure that it has started properly
+1.  **Clone the Evaluation Repository:**
 
-```bash
-export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
-curl $CODEGEN_ENDPOINT \
-    -H "Content-Type: application/json" \
-    -d '{"messages": "Implement a high-level API for a TODO list application. The API takes as input an operation request and updates the TODO list in place. If the request is invalid, raise an exception."}'
+    ```shell
+    git clone https://github.com/opea-project/GenAIEval
+    cd GenAIEval
+    ```
 
-```
+2.  **Install Dependencies:**
+    ```shell
+    pip install -r requirements.txt
+    pip install -e .
+    ```
 
-### Generation and Evaluation
+## Running the Accuracy Benchmark
 
-For evaluating the models on coding tasks or specifically coding LLMs, we follow the [bigcode-evaluation-harness](https://github.com/bigcode-project/bigcode-evaluation-harness) and provide the command line usage and function call usage. [HumanEval](https://huggingface.co/datasets/openai_humaneval), [HumanEval+](https://huggingface.co/datasets/evalplus/humanevalplus), [InstructHumanEval](https://huggingface.co/datasets/codeparrot/instructhumaneval), [APPS](https://huggingface.co/datasets/codeparrot/apps), [MBPP](https://huggingface.co/datasets/mbpp), [MBPP+](https://huggingface.co/datasets/evalplus/mbppplus), and [DS-1000](https://github.com/HKUNLP/DS-1000/) for both completion (left-to-right) and insertion (FIM) mode are available.
+1.  **Set Environment Variables:**
+    Replace `{your_ip}` with the IP address of your deployed CodeGen service and `{your_model_identifier}` with the identifier of the model being tested (e.g., `Qwen/CodeQwen1.5-7B-Chat`).
 
-#### Environment
+    ```shell
+    export CODEGEN_ENDPOINT="http://{your_ip}:7778/v1/codegen"
+    export CODEGEN_MODEL="{your_model_identifier}"
+    ```
 
-```shell
-git clone https://github.com/opea-project/GenAIEval
-cd GenAIEval
-pip install -r requirements.txt
-pip install -e .
+    _Note: Port `7778` is the default for the CodeGen gateway; adjust if you customized it._
 
-```
+2.  **Execute the Benchmark Script:**
+    The script will run the evaluation tasks (e.g., HumanEval by default) against the specified endpoint.
 
-#### Evaluation
+    ```shell
+    bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
+    ```
 
-```
-export CODEGEN_ENDPOINT="http://${your_ip}:7778/v1/codegen"
-export CODEGEN_MODEL=your_model
-bash run_acc.sh $CODEGEN_MODEL $CODEGEN_ENDPOINT
-```
+    _Note: Currently, the framework runs the full task set by default. Using 'limit' parameters might affect result comparability._
 
-**_Note:_** Currently, our framework is designed to execute tasks in full. To ensure the accuracy of results, we advise against using the 'limit' or 'limit_start' parameters to restrict the number of test samples.
+## Understanding the Results
 
-### accuracy Result
+The results will be printed to the console and saved in `evaluation_results.json`. A key metric is `pass@k`, which represents the percentage of problems solved correctly within `k` generated attempts (e.g., `pass@1` means solved on the first try).
 
-Here is the tested result for your reference
+Example output snippet:
 
 ```json
 {
   "humaneval": {
     "pass@1": 0.7195121951219512
   },
   "config": {
-    "prefix": "",
-    "do_sample": true,
-    "temperature": 0.2,
-    "top_k": 0,
-    "top_p": 0.95,
-    "n_samples": 1,
-    "eos": "<|endoftext|>",
-    "seed": 0,
     "model": "Qwen/CodeQwen1.5-7B-Chat",
-    "modeltype": "causal",
-    "peft_model": null,
-    "revision": null,
-    "use_auth_token": false,
-    "trust_remote_code": false,
     "tasks": "humaneval",
     "instruction_tokens": null,
     "batch_size": 1,
@@ -93,7 +97,9 @@ Here is the tested result for your reference
     "prompt": "prompt",
     "max_memory_per_gpu": null,
     "check_references": false,
-    "codegen_url": "http://192.168.123.104:31234/v1/codegen"
+    "codegen_url": "http://192.168.123.104:7778/v1/codegen"
   }
 }
 ```
+
+This indicates a `pass@1` score of approximately 72% on the HumanEval benchmark for the specified model via the CodeGen service endpoint.
@@ -1,77 +1,73 @@
-# CodeGen Benchmarking
+# CodeGen Performance Benchmark
 
-This folder contains a collection of scripts to enable inference benchmarking by leveraging a comprehensive benchmarking tool, [GenAIEval](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/README.md), that enables throughput analysis to assess inference performance.
+## Table of Contents
 
-By following this guide, you can run benchmarks on your deployment and share the results with the OPEA community.
+- [Purpose](#purpose)
+- [Benchmarking Tool](#benchmarking-tool)
+- [Metrics Measured](#metrics-measured)
+- [Prerequisites](#prerequisites)
+- [Running the Performance Benchmark](#running-the-performance-benchmark)
+- [Data Collection](#data-collection)
 
 ## Purpose
 
-We aim to run these benchmarks and share them with the OPEA community for three primary reasons:
+This guide describes how to benchmark the inference performance (throughput and latency) of a deployed CodeGen service. The results help understand the service's capacity under load and compare different deployment configurations or models. This benchmark primarily targets Kubernetes deployments but can be adapted for Docker.
 
-- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
-- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
-- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading llms, serving frameworks etc.
+## Benchmarking Tool
 
-## Metrics
+We use the [GenAIEval](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/README.md) tool for performance benchmarking, which simulates concurrent users sending requests to the service endpoint.
 
-The benchmark will report the below metrics, including:
+## Metrics Measured
 
-- Number of Concurrent Requests
-- End-to-End Latency: P50, P90, P99 (in milliseconds)
-- End-to-End First Token Latency: P50, P90, P99 (in milliseconds)
-- Average Next Token Latency (in milliseconds)
-- Average Token Latency (in milliseconds)
-- Requests Per Second (RPS)
-- Output Tokens Per Second
-- Input Tokens Per Second
+The benchmark reports several key performance indicators:
 
-Results will be displayed in the terminal and saved as CSV file named `1_testspec.yaml`.
+- **Concurrency:** Number of concurrent requests simulated.
+- **End-to-End Latency:** Time from request submission to final response received (P50, P90, P99 in ms).
+- **End-to-End First Token Latency:** Time from request submission to first token received (P50, P90, P99 in ms).
+- **Average Next Token Latency:** Average time between subsequent generated tokens (in ms).
+- **Average Token Latency:** Average time per generated token (in ms).
+- **Requests Per Second (RPS):** Throughput of the service.
+- **Output Tokens Per Second:** Rate of token generation.
+- **Input Tokens Per Second:** Rate of token consumption.
 
-## Getting Started
+## Prerequisites
 
-We recommend using Kubernetes to deploy the CodeGen service, as it offers benefits such as load balancing and improved scalability. However, you can also deploy the service using Docker if that better suits your needs.
+- A running CodeGen service accessible via an HTTP endpoint. Refer to the main [CodeGen README](../../README.md) for deployment options (Kubernetes recommended for load balancing/scalability).
+- **If using Kubernetes:**
+  - A working Kubernetes cluster (refer to OPEA K8s setup guides if needed).
+  - `kubectl` configured to access the cluster from the node where the benchmark will run (typically the master node).
+  - Ensure sufficient `ulimit` for network connections on worker nodes hosting the service pods (e.g., `LimitNOFILE=65536` or higher in containerd/docker config).
+- **General:**
+  - Python 3.8+ on the node running the benchmark script.
+  - Network access from the benchmark node to the CodeGen service endpoint.
 
-### Prerequisites
+## Running the Performance Benchmark
 
-- Install Kubernetes by following [this guide](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md).
+1.  **Deploy CodeGen Service:** Ensure your CodeGen service is deployed and accessible. Note the service endpoint URL (e.g., obtained via `kubectl get svc` or your ingress configuration if using Kubernetes, or `http://{host_ip}:{port}` for Docker).
 
-- Every node has direct internet access
-- Set up kubectl on the master node with access to the Kubernetes cluster.
-- Install Python 3.8+ on the master node for running GenAIEval.
-- Ensure all nodes have a local /mnt/models folder, which will be mounted by the pods.
-- Ensure that the container's ulimit can meet the the number of requests.
+2.  **Configure Benchmark Parameters (Optional):**
+    Set environment variables to customize the test queries and output directory. The `USER_QUERIES` variable defines the number of concurrent requests for each test run.
 
-```bash
-# The way to modify the containered ulimit:
-sudo systemctl edit containerd
-# Add two lines:
-[Service]
-LimitNOFILE=65536:1048576
+    ```bash
+    # Example: Four runs with 128 concurrent requests each
+    export USER_QUERIES="[128, 128, 128, 128]"
+    # Example: Output directory
+    export TEST_OUTPUT_DIR="/tmp/benchmark_output"
+    # Set the target endpoint URL
+    export CODEGEN_ENDPOINT_URL="http://{your_service_ip_or_hostname}:{port}/v1/codegen"
+    ```
 
-sudo systemctl daemon-reload; sudo systemctl restart containerd
-```
+    _Replace `{your_service_ip_or_hostname}:{port}` with the actual accessible URL of your CodeGen gateway service._
 
-### Test Steps
+3.  **Execute the Benchmark Script:**
+    Run the script, optionally specifying the number of Kubernetes nodes involved if relevant for reporting context (the script itself runs from one node).
+    ```bash
+    # Clone GenAIExamples if you haven't already
+    # cd GenAIExamples/CodeGen/benchmark/performance
+    bash benchmark.sh # Add '-n <node_count>' if desired for logging purposes
+    ```
+    _Ensure the `benchmark.sh` script is adapted to use `CODEGEN_ENDPOINT_URL` and potentially `USER_QUERIES`, `TEST_OUTPUT_DIR`._
 
-Please deploy CodeGen service before benchmarking.
+## Data Collection
 
-#### Run Benchmark Test
-
-Before the benchmark, we can configure the number of test queries and test output directory by:
-
-```bash
-export USER_QUERIES="[128, 128, 128, 128]"
-export TEST_OUTPUT_DIR="/tmp/benchmark_output"
-```
-
-And then run the benchmark by:
-
-```bash
-bash benchmark.sh -n <node_count>
-```
-
-The argument `-n` refers to the number of test nodes.
-
-#### Data collection
-
-All the test results will come to this folder `/tmp/benchmark_output` configured by the environment variable `TEST_OUTPUT_DIR` in previous steps.
+Benchmark results will be displayed in the terminal upon completion. Detailed results, typically including raw data and summary statistics, will be saved in the directory specified by `TEST_OUTPUT_DIR` (defaulting to `/tmp/benchmark_output`). CSV files (e.g., `1_testspec.yaml.csv`) containing metrics for each run are usually generated here.