unstable results of qwen-72b-instruct on IFEVAL? #476

wenhuach21 · 2025-03-20T11:22:57Z

The community has reported that we have unstable result of ifeval for Qwen2.5-72b-Instruct.
https://kaitchup.substack.com/p/a-comparison-of-5-quantization-methods

The interesting part is all of 3 recipes report satisfactory results.

@WeiweiZhang1 Could you have 5-10 runs with default parameters and tested the IFeval on with HF and vllm backend respectively. 4 bits first and then 8 bits

wenhuach21 · 2025-03-21T01:38:17Z

nsamples = 512
iterations = 500
model_dtype = float16
symmetric quantization
auto_gptq export
group size = 128
This produced bad quantization for the 4-bit and 8-bit versions, but worked well with 2-bit and a group size of 32.
Other hyperparameter values, as you suggested, performed well.

wenhuach21 · 2025-03-24T03:48:16Z

I ran 4 experiments in one of my environments. Despite some randomness, all yielded reasonable results with W4G128.

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.8369	±	N/A
		none	inst_level_strict_acc	↑	0.7782	±	N/A
		none	prompt_level_loose_acc	↑	0.7745	±	0.0180
		none	prompt_level_strict_acc	↑	0.6932	±	0.0198

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.8477	±	N/A
		none	inst_level_strict_acc	↑	0.7914	±	N/A
		none	prompt_level_loose_acc	↑	0.7800	±	0.0178
		none	prompt_level_strict_acc	↑	0.6969	±	0.0198

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.8345	±	N/A
		none	inst_level_strict_acc	↑	0.7734	±	N/A
		none	prompt_level_loose_acc	↑	0.7671	±	0.0182
		none	prompt_level_strict_acc	↑	0.6895	±	0.0199

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.8501	±	N/A
		none	inst_level_strict_acc	↑	0.7830	±	N/A
		none	prompt_level_loose_acc	↑	0.7874	±	0.0176
		none	prompt_level_strict_acc	↑	0.6950	±	0.0198

wenhuach21 · 2025-03-27T08:29:04Z

another env

Tasks	Version	Filter	Metric		Value		Stderr
leaderboard_ifeval	3	none	inst_level_loose_acc	↑	0.8477	±	N/A
		none	inst_level_strict_acc	↑	0.7854	±	N/A
		none	prompt_level_loose_acc	↑	0.7763	±	0.0179
		none	prompt_level_strict_acc	↑	0.6932	±	0.0198

WeiweiZhang1 · 2025-03-27T08:41:11Z

My test results:
nsamples = 512
iterations = 500
model_dtype = float16
symmetric quantization
auto_gptq export
group size = 128

benjamin-marie · 2025-03-27T12:45:24Z

That's very interesting and very good news!
Thank you for digging into this.

This is with the HF backend? I usually ran vLLM since it is much faster. Maybe it makes a difference?
I'll double-check and make the corrections!

wenhuach21 · 2025-03-27T13:05:33Z

That's very interesting and very good news! Thank you for digging into this.

This is with the HF backend? I usually ran vLLM since it is much faster. Maybe it makes a difference? I'll double-check and make the corrections!

I have verified your int4 model with vllm. The accuracy is close to these results. I am evaluating your int8 model.

wenhuach21 · 2025-03-27T15:13:34Z

@benjamin-marie For int8 model, I could reproduce your result with vllm backend, HF backend is ok. I have opened an issue in lm-eval harness. EleutherAI/lm-evaluation-harness#2851

benjamin-marie · 2025-03-27T15:50:31Z

I wonder whether the problem is not with vLLM rather than with lm_eval. I'll do some more tests.

wenhuach21 · 2025-03-28T05:02:33Z

HF torch 2.6 is fine, accuracy on torch 2.5 is low

wenhuach21 assigned WeiweiZhang1 Mar 20, 2025

wenhuach21 changed the title ~~unstable results of qwen-72b-instruct on IFEVAL~~ unstable results of qwen-72b-instruct on IFEVAL? Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

unstable results of qwen-72b-instruct on IFEVAL? #476

unstable results of qwen-72b-instruct on IFEVAL? #476

wenhuach21 commented Mar 20, 2025 •

edited

Loading

wenhuach21 commented Mar 21, 2025

Uh oh!

wenhuach21 commented Mar 24, 2025 •

edited

Loading

Uh oh!

wenhuach21 commented Mar 27, 2025

Uh oh!

WeiweiZhang1 commented Mar 27, 2025

Uh oh!

benjamin-marie commented Mar 27, 2025

Uh oh!

wenhuach21 commented Mar 27, 2025 •

edited

Loading

Uh oh!

wenhuach21 commented Mar 27, 2025 •

edited

Loading

Uh oh!

benjamin-marie commented Mar 27, 2025

Uh oh!

wenhuach21 commented Mar 28, 2025

Uh oh!

unstable results of qwen-72b-instruct on IFEVAL? #476

unstable results of qwen-72b-instruct on IFEVAL? #476

Comments

wenhuach21 commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

wenhuach21 commented Mar 21, 2025

Uh oh!

wenhuach21 commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenhuach21 commented Mar 27, 2025

Uh oh!

WeiweiZhang1 commented Mar 27, 2025

Uh oh!

benjamin-marie commented Mar 27, 2025

Uh oh!

wenhuach21 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wenhuach21 commented Mar 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benjamin-marie commented Mar 27, 2025

Uh oh!

wenhuach21 commented Mar 28, 2025

Uh oh!

wenhuach21 commented Mar 20, 2025 •

edited

Loading

wenhuach21 commented Mar 24, 2025 •

edited

Loading

wenhuach21 commented Mar 27, 2025 •

edited

Loading

wenhuach21 commented Mar 27, 2025 •

edited

Loading