-
-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[CI] Add mteb testing to test the accuracy of the embedding model #17175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Thanks for adding this, can you fix pre-commit? |
Delete the benchmarks/eval/test_mteb.py finally, or change it into a more general testing script. |
Hmm I'm not sure we want to have benchmark/evals. For correctness checking in the CI, we should be able to just test 2-3 cases to keep the stability. |
I tested more models, and most of them showed strong numerical stability ( <1e-4 ), even better than I imagined. The score differences between different models are also quite noticeable ( >1e-3). This makes mteb STS12 a great embedding model test set. |
Thanks for your patience! |
Thanks for reviewing. By carefully studying the code, I indeed learned a lot of weird stuff. |
What should I do to make the merge go more smoothly? |
The failing doc build seems related to this PR. Maybe because you changed the dependencies |
need manually merge #16859 first |
Head branch was pushed to by a user without write access
This pull request has merge conflicts that must be resolved before it can be |
Wouldn't this also remove the "correct" docs installation? |
normal there shouldn't be a library called docs, otherwise it would cause many problems I first tried to see if it could pass the test, very sorry for using a lot of CI resources |
Let's try the CI again |
QvQ |
Need to install mteb for model tests as well. |
Head branch was pushed to by a user without write access
https://buildkite.com/vllm/ci/builds/19198 Last check, are the errors for V1 Test and Speculative decoding tests unrelated to this pr? |
Yeah they are unrelated |
I apologize for any inconvenience caused, hope this time it will pass. |
Summary:
If during inference it is found that seqlen > max_trained_positions, it will automatically change from NomicBertRotaryEmbedding to NomicBertDynamicNTKRotaryEmbedding.
https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long/blob/main/modeling_hf_nomic_bert.py#L639
https://huggingface.co/nomic-ai/nomic-bert-2048/blob/main/modeling_hf_nomic_bert.py#L1413
It might lead to hard-to-detect bugs.
we ignore config.rotary_scaling_factor so that for datasets shorter than max_trained_positions 2048, the results are consistent with SentenceTransformer.
The context extension uses vllm style rope_theta and rope_scaling.
Task selection
Although mteb v2 significantly speeds up the testing process, it still requires several hours to complete all tests.
Here choose a small test task in mteb: STS12.
Running time on 4090 is approximately 26.95s
The score for this test set is strongly numerical stability ( <1e-4 ) with respect to minor variations in the model implementation and tensor data types.
The score differences between different models are also quite noticeable ( >1e-3).
This makes mteb STS12 a great embedding model test set.
numerical stability
The difference is very subtle (<1e-4) at least on this test set.
Ten rounds:
The results of ten iterations seem to show that converting float32 to float16 yields better results than bfloat16 (vllm defaults to converting float32 to float16).
more about numerical stability
Most models exhibit excellent numerical stability
slightly numerically unstable model
fp16:
| intfloat/multilingual-e5-small | 0.7805425596252846 | -0.2749311085815237 | 0.006216913108536066 |
fp32:
| intfloat/multilingual-e5-small | 0.7805425596252846 | -1.6403316041024851e-06 | 7.53539269543218e-06 |
pooling_type="MEAN" + fp16 (default)
intfloat/multilingual-e5-large-instruct 0.8224491209469045 -0.28623335791513993 0.007169234312147499
pooling_type="MEAN" + fp32
intfloat/multilingual-e5-large-instruct 0.8224491209469045 -2.3497119421289625e-06 7.898194995699927e-06
fp16:
| jinaai/jina-embeddings-v3 | 0.7834129787836271 | -0.0709833671361465 | 0.004834963031278825 |
fp32:
| jinaai/jina-embeddings-v3 | 0.8243646209061513 | -3.119267999662778e-05 | 6.651161140301139e-06 |