You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We provide a sample script to run the full pipeline:
321
321
322
322
```bash
323
323
bash run.sh
324
324
```
325
325
326
-
## Result Analysis
326
+
## 📊 Result Analysis
327
327
328
328
We provide a script to replicate the analysis like Elo Rating and Task Solve Rate, which helps you understand the performance of the models further.
329
329
@@ -340,7 +340,7 @@ python get_results.py
340
340
We share pre-generated code samples from LLMs we have [evaluated](https://huggingface.co/spaces/bigcode/bigcodebench-leaderboard):
341
341
* See the attachment of our [v0.1.5](https://github.com/bigcode-project/bigcodebench/releases/tag/v0.1.5). We include both `sanitized_samples.zip` and `sanitized_samples_calibrated.zip` for your convenience.
342
342
343
-
## Known Issues
343
+
## 🐞 Known Issues
344
344
345
345
- [ ] Due to the flakes in the evaluation, the execution results may vary slightly (~0.2%) between runs. We are working on improving the evaluation stability.
0 commit comments