Skip to content

unstable results of qwen-72b-instruct on IFEVAL? #476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wenhuach21 opened this issue Mar 20, 2025 · 9 comments
Open

unstable results of qwen-72b-instruct on IFEVAL? #476

wenhuach21 opened this issue Mar 20, 2025 · 9 comments
Assignees

Comments

@wenhuach21
Copy link
Contributor

wenhuach21 commented Mar 20, 2025

The community has reported that we have unstable result of ifeval for Qwen2.5-72b-Instruct.
https://kaitchup.substack.com/p/a-comparison-of-5-quantization-methods

The interesting part is all of 3 recipes report satisfactory results.

@WeiweiZhang1 Could you have 5-10 runs with default parameters and tested the IFeval on with HF and vllm backend respectively. 4 bits first and then 8 bits

@wenhuach21 wenhuach21 changed the title unstable results of qwen-72b-instruct on IFEVAL unstable results of qwen-72b-instruct on IFEVAL? Mar 20, 2025
@wenhuach21
Copy link
Contributor Author

nsamples = 512
iterations = 500
model_dtype = float16
symmetric quantization
auto_gptq export
group size = 128
This produced bad quantization for the 4-bit and 8-bit versions, but worked well with 2-bit and a group size of 32.
Other hyperparameter values, as you suggested, performed well.

@wenhuach21
Copy link
Contributor Author

wenhuach21 commented Mar 24, 2025

I ran 4 experiments in one of my environments. Despite some randomness, all yielded reasonable results with W4G128.

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_ifeval 3 none 0 inst_level_loose_acc 0.8369 ± N/A
none 0 inst_level_strict_acc 0.7782 ± N/A
none 0 prompt_level_loose_acc 0.7745 ± 0.0180
none 0 prompt_level_strict_acc 0.6932 ± 0.0198
Tasks Version Filter n-shot Metric Value Stderr
leaderboard_ifeval 3 none 0 inst_level_loose_acc 0.8477 ± N/A
none 0 inst_level_strict_acc 0.7914 ± N/A
none 0 prompt_level_loose_acc 0.7800 ± 0.0178
none 0 prompt_level_strict_acc 0.6969 ± 0.0198
Tasks Version Filter n-shot Metric Value Stderr
leaderboard_ifeval 3 none 0 inst_level_loose_acc 0.8345 ± N/A
none 0 inst_level_strict_acc 0.7734 ± N/A
none 0 prompt_level_loose_acc 0.7671 ± 0.0182
none 0 prompt_level_strict_acc 0.6895 ± 0.0199
Tasks Version Filter n-shot Metric Value Stderr
leaderboard_ifeval 3 none 0 inst_level_loose_acc 0.8501 ± N/A
none 0 inst_level_strict_acc 0.7830 ± N/A
none 0 prompt_level_loose_acc 0.7874 ± 0.0176
none 0 prompt_level_strict_acc 0.6950 ± 0.0198

@wenhuach21
Copy link
Contributor Author

another env

Tasks Version Filter n-shot Metric Value Stderr
leaderboard_ifeval 3 none 0 inst_level_loose_acc 0.8477 ± N/A
none 0 inst_level_strict_acc 0.7854 ± N/A
none 0 prompt_level_loose_acc 0.7763 ± 0.0179
none 0 prompt_level_strict_acc 0.6932 ± 0.0198

@WeiweiZhang1
Copy link
Collaborator

My test results:
nsamples = 512
iterations = 500
model_dtype = float16
symmetric quantization
auto_gptq export
group size = 128

Image

Image

Image

@benjamin-marie
Copy link

That's very interesting and very good news!
Thank you for digging into this.

This is with the HF backend? I usually ran vLLM since it is much faster. Maybe it makes a difference?
I'll double-check and make the corrections!

@wenhuach21
Copy link
Contributor Author

wenhuach21 commented Mar 27, 2025

That's very interesting and very good news! Thank you for digging into this.

This is with the HF backend? I usually ran vLLM since it is much faster. Maybe it makes a difference? I'll double-check and make the corrections!

I have verified your int4 model with vllm. The accuracy is close to these results. I am evaluating your int8 model.

@wenhuach21
Copy link
Contributor Author

wenhuach21 commented Mar 27, 2025

@benjamin-marie For int8 model, I could reproduce your result with vllm backend, HF backend is ok. I have opened an issue in lm-eval harness. EleutherAI/lm-evaluation-harness#2851

@benjamin-marie
Copy link

I wonder whether the problem is not with vLLM rather than with lm_eval. I'll do some more tests.

@wenhuach21
Copy link
Contributor Author

HF torch 2.6 is fine, accuracy on torch 2.5 is low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants