[CI] Add mteb testing to test the accuracy of the embedding model #17175

noooop · 2025-04-25T09:16:16Z

Summary:

Add mteb STS12 testing to test the accuracy of the embedding model (tests/entrypoints/openai/correctness/test_mteb.py)
test snowflake_arctic_embed using mteb STS12. (tests/models/embedding/language/test_snowflake_arctic_embed.py)
Snowflake/snowflake-arctic-embed-m-long

If during inference it is found that seqlen > max_trained_positions, it will automatically change from NomicBertRotaryEmbedding to NomicBertDynamicNTKRotaryEmbedding.

https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/blob/main/modeling_hf_nomic_bert.py#L639
https://huggingface.co/nomic-ai/nomic-bert-2048/blob/main/modeling_hf_nomic_bert.py#L1413

        if seqlen > self.max_position_embeddings:
            base = self.base * (
                (self.rotary_scaling_factor * seqlen / self.max_position_embeddings) - (self.rotary_scaling_factor - 1)
            ) ** (self.dim / (self.dim - 2))
            inv_freq = self._compute_inv_freq(base=base, device=device)
            self.register_buffer("inv_freq", inv_freq, persistent=False)

It might lead to hard-to-detect bugs.

we ignore config.rotary_scaling_factor so that for datasets shorter than max_trained_positions 2048, the results are consistent with SentenceTransformer.

The context extension uses vllm style rope_theta and rope_scaling.

Task selection

Although mteb v2 significantly speeds up the testing process, it still requires several hours to complete all tests.

Here choose a small test task in mteb: STS12.

Running time on 4090 is approximately 26.95s
The score for this test set is strongly numerical stability ( <1e-4 ) with respect to minor variations in the model implementation and tensor data types.
The score differences between different models are also quite noticeable ( >1e-3).

This makes mteb STS12 a great embedding model test set.

numerical stability

float16
VLLM main score:  0.787317856459396
SentenceTransformer main score:  0.7873427091972599
Difference:  2.4852737863900742e-05  

bfloat16
VLLM main score:  0.7873663350672234
SentenceTransformer main score:  0.7873427091972599
Difference:  -2.3625869963517232e-05

The difference is very subtle (<1e-4) at least on this test set.

Ten rounds:

float32 2.2034500413159464e-07 2.2674275951409703e-06
float16 -1.2960828871366736e-05 6.329514177900761e-06
bfloat16 -0.0001093438180336248 3.559915712887334e-05

The results of ten iterations seem to show that converting float32 to float16 yields better results than bfloat16 (vllm defaults to converting float32 to float16).

more about numerical stability

Most models exhibit excellent numerical stability

model name	st_main_score	Difference	std
BAAI/bge-m3	0.7873424632849964	-4.014647728589615e-06	1.266416587059263e-05
BAAI/bge-base-en-v1.5	0.7802846624612514	1.1294266950234721e-05	6.865350381034025e-06
Snowflake/snowflake-arctic-embed-xs	0.7149276890717804	1.5002530987628937e-05	5.132361246049283e-06
Snowflake/snowflake-arctic-embed-s	0.7408120447186094	1.2957674633273797e-05	5.364178900440517e-06
Snowflake/snowflake-arctic-embed-m	0.6467522411844727	-3.727433978584216e-06	8.904071772230203e-06
Snowflake/snowflake-arctic-embed-l	0.6362746289758823	9.515755331335196e-05	2.023830079795977e-05
Snowflake/snowflake-arctic-embed-m-long	0.6811445157066163	3.396798716037708e-05	1.224356222837439e-05
Snowflake/snowflake-arctic-embed-m-v1.5	0.6490882209298032	1.8871733633019083e-05	6.591107037250243e-06
Snowflake/snowflake-arctic-embed-l-v2.0	0.7122583106737259	1.074976228776503e-05	1.3400689624215418e-05
Snowflake/snowflake-arctic-embed-m-v2.0	0.7066229164460937	1.5418442692483048e-05	9.792523972420118e-06
Alibaba-NLP/gte-Qwen2-1.5B-instruct	0.7280529229028553	5.124313459714536e-05	1.6385524234026275e-05

slightly numerically unstable model

intfloat/multilingual-e5-small shows a significant drop when using fp16 , and fp32 needs to be used.

fp16:
| intfloat/multilingual-e5-small | 0.7805425596252846 | -0.2749311085815237 | 0.006216913108536066 |

fp32:
| intfloat/multilingual-e5-small | 0.7805425596252846 | -1.6403316041024851e-06 | 7.53539269543218e-06 |

intfloat/multilingual-e5-large-instruct shows a significant drop when using fp16 , and fp32 needs to be used.

pooling_type="MEAN" + fp16 (default)
intfloat/multilingual-e5-large-instruct 0.8224491209469045 -0.28623335791513993 0.007169234312147499

pooling_type="MEAN" + fp32
intfloat/multilingual-e5-large-instruct 0.8224491209469045 -2.3497119421289625e-06 7.898194995699927e-06

jinaai/jina-embeddings-v3 shows a slight drop when using fp16.

fp16:
| jinaai/jina-embeddings-v3 | 0.7834129787836271 | -0.0709833671361465 | 0.004834963031278825 |
fp32:
| jinaai/jina-embeddings-v3 | 0.8243646209061513 | -3.119267999662778e-05 | 6.651161140301139e-06 |

github-actions · 2025-04-25T09:16:25Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

DarkLight1337 · 2025-04-25T09:36:13Z

Thanks for adding this, can you fix pre-commit?

noooop · 2025-04-25T10:12:24Z

@DarkLight1337

Delete the benchmarks/eval/test_mteb.py finally, or change it into a more general testing script.

DarkLight1337 · 2025-04-25T10:21:17Z

cc @mgoin @comaniac do you think we can incorporate this eval script into our existing scripts? Or would it be better to keep them separate?

comaniac · 2025-04-26T07:56:38Z

Hmm I'm not sure we want to have benchmark/evals. For correctness checking in the CI, we should be able to just test 2-3 cases to keep the stability.

noooop · 2025-04-28T07:59:20Z

@DarkLight1337

I tested more models, and most of them showed strong numerical stability ( <1e-4 ), even better than I imagined.

The score differences between different models are also quite noticeable ( >1e-3).

This makes mteb STS12 a great embedding model test set.

vllm/model_executor/models/bert.py

DarkLight1337 · 2025-04-30T06:49:15Z

Thanks for your patience!

noooop · 2025-04-30T06:51:36Z

Thanks for reviewing.

By carefully studying the code, I indeed learned a lot of weird stuff.

noooop · 2025-05-01T03:32:17Z

@DarkLight1337

What should I do to make the merge go more smoothly?

DarkLight1337 · 2025-05-01T04:03:01Z

The failing doc build seems related to this PR. Maybe because you changed the dependencies

noooop · 2025-05-01T04:21:51Z

need manually merge #16859 first

mergify · 2025-05-01T06:08:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @noooop.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

DarkLight1337 · 2025-05-01T14:01:38Z

Wouldn't this also remove the "correct" docs installation?

noooop · 2025-05-01T14:14:17Z

normal there shouldn't be a library called docs, otherwise it would cause many problems

I first tried to see if it could pass the test, very sorry for using a lot of CI resources

.buildkite/test-pipeline.yaml

requirements/test.in

requirements/test.txt

DarkLight1337 · 2025-05-02T06:27:48Z

Let's try the CI again

noooop · 2025-05-02T06:31:11Z

Let's try the CI again

QvQ

tests/entrypoints/openai/correctness/test_mteb.py

DarkLight1337 · 2025-05-02T08:26:54Z

Need to install mteb for model tests as well.

noooop · 2025-05-02T09:42:43Z

@DarkLight1337

https://buildkite.com/vllm/ci/builds/19198

Last check, are the errors for V1 Test and Speculative decoding tests unrelated to this pr?

DarkLight1337 · 2025-05-02T09:48:50Z

Yeah they are unrelated

noooop · 2025-05-02T09:56:39Z

I apologize for any inconvenience caused, hope this time it will pass.

noooop requested review from DarkLight1337, robertgshaw2-redhat and simon-mo as code owners April 25, 2025 09:16

mergify bot added the ci/build label Apr 25, 2025

noooop closed this Apr 28, 2025

noooop force-pushed the mteb branch from 93c8788 to 8262a3e Compare April 28, 2025 04:24

noooop reopened this Apr 28, 2025

noooop requested a review from ywang96 as a code owner April 28, 2025 07:53

This was referenced Apr 29, 2025

[Bug]: Batch embedding inference is inconsistent with hf #15393

Open

[Bug]: multilingual-e5-large embedding models should use mean pooling instead of the last #15929

Open

DarkLight1337 reviewed Apr 29, 2025

View reviewed changes

vllm/model_executor/models/bert.py Show resolved Hide resolved

DarkLight1337 approved these changes Apr 30, 2025

View reviewed changes

DarkLight1337 enabled auto-merge (squash) April 30, 2025 06:49

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Apr 30, 2025

auto-merge was automatically disabled May 1, 2025 04:36
Head branch was pushed to by a user without write access

noooop force-pushed the mteb branch from f163955 to 62944e3 Compare May 1, 2025 04:48

DarkLight1337 enabled auto-merge (squash) May 1, 2025 05:05

mergify bot added the needs-rebase label May 1, 2025

fix

b24c172

noooop marked this pull request as draft May 2, 2025 01:48

noooop added 2 commits May 2, 2025 09:48

Merge branch 'vllm-project:main' into mteb

a697453

del mteb docs from docker

ec83de3

noooop marked this pull request as ready for review May 2, 2025 02:12

noooop marked this pull request as draft May 2, 2025 03:39

del mteb docs from Documentation Build

fb8be45

noooop commented May 2, 2025

View reviewed changes

.buildkite/test-pipeline.yaml Outdated Show resolved Hide resolved

fix

b2197c9

DarkLight1337 reviewed May 2, 2025

View reviewed changes

requirements/test.in Outdated Show resolved Hide resolved

noooop added 3 commits May 2, 2025 14:24

fix

c31a87b

Merge branch 'vllm-project:main' into mteb

47a4a00

fix

576cf4a

DarkLight1337 reviewed May 2, 2025

View reviewed changes

requirements/test.txt Outdated Show resolved Hide resolved

Merge remote-tracking branch 'origin/mteb' into mteb

7c79bcc

noooop marked this pull request as ready for review May 2, 2025 06:28

DarkLight1337 enabled auto-merge (squash) May 2, 2025 06:29

noooop commented May 2, 2025

View reviewed changes

tests/entrypoints/openai/correctness/test_mteb.py Show resolved Hide resolved

fix

6788750

auto-merge was automatically disabled May 2, 2025 09:34
Head branch was pushed to by a user without write access

fix

16032d8

fix

2118976

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] Add mteb testing to test the accuracy of the embedding model #17175

[CI] Add mteb testing to test the accuracy of the embedding model #17175

noooop commented Apr 25, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Apr 25, 2025

DarkLight1337 commented Apr 25, 2025

noooop commented Apr 25, 2025

DarkLight1337 commented Apr 25, 2025

comaniac commented Apr 26, 2025

noooop commented Apr 28, 2025

DarkLight1337 commented Apr 30, 2025

noooop commented Apr 30, 2025

noooop commented May 1, 2025

DarkLight1337 commented May 1, 2025

noooop commented May 1, 2025 •

edited

Loading

mergify bot commented May 1, 2025

DarkLight1337 commented May 1, 2025

noooop commented May 1, 2025 •

edited

Loading

DarkLight1337 commented May 2, 2025

noooop commented May 2, 2025

DarkLight1337 commented May 2, 2025 •

edited

Loading

noooop commented May 2, 2025

DarkLight1337 commented May 2, 2025

noooop commented May 2, 2025

[CI] Add mteb testing to test the accuracy of the embedding model #17175

Are you sure you want to change the base?

[CI] Add mteb testing to test the accuracy of the embedding model #17175

Conversation

noooop commented Apr 25, 2025 • edited by github-actions bot Loading

Summary:

Task selection

numerical stability

more about numerical stability

slightly numerically unstable model

github-actions bot commented Apr 25, 2025

DarkLight1337 commented Apr 25, 2025

noooop commented Apr 25, 2025

DarkLight1337 commented Apr 25, 2025

comaniac commented Apr 26, 2025

noooop commented Apr 28, 2025

DarkLight1337 commented Apr 30, 2025

noooop commented Apr 30, 2025

noooop commented May 1, 2025

DarkLight1337 commented May 1, 2025

noooop commented May 1, 2025 • edited Loading

mergify bot commented May 1, 2025

DarkLight1337 commented May 1, 2025

noooop commented May 1, 2025 • edited Loading

DarkLight1337 commented May 2, 2025

noooop commented May 2, 2025

DarkLight1337 commented May 2, 2025 • edited Loading

noooop commented May 2, 2025

DarkLight1337 commented May 2, 2025

noooop commented May 2, 2025

noooop commented Apr 25, 2025 •

edited by github-actions bot

Loading

noooop commented May 1, 2025 •

edited

Loading

noooop commented May 1, 2025 •

edited

Loading

DarkLight1337 commented May 2, 2025 •

edited

Loading