-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Improve execution of terms queries over wildcard fields #128986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @iverase, I've created a changelog YAML for you. |
Pinging @elastic/es-search-relevance (Team:Search Relevance) |
@iverase Before merging this PR do we want to add "terms query on a wildcard field" operation in our rally benchmarks (for example, http_logs already has a wildcard field) to measure the improve performance? |
http_logs is using a https://github.com/elastic/rally-tracks/blob/master/http_logs/index.json |
@iverase For runtime schedule there is an indexed wildcard field. I wonder if we can add terms query on that. |
} | ||
} | ||
|
||
public void testTermsQueryDuel() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How long this test will run in number of terms is 8192? Is the performance acceptable to be included as a regular test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is very fast, the full suite takes a couple of seconds.
I will need to learn how to run this mode .... tracks are becoming so complex, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iverase Thanks, great optimization.
I am not sure about an approach to take middle toke for a term, but I trust your intuition on this.
- Would be nice to have a dedicated benchmark for terms query to see if got speedups.
In this case is easy to prove we got an improvement. If you run the test I have added with the old code, you will get an OOM. It will till be good to have benchmarks not to get regressions. |
💔 Backport failed
You can use sqren/backport to manually backport by running |
This commit implements TermsQuery(Collection<?> values, @nullable SearchExecutionContext context) in the WildCardFieldMapper to avoid memory pressure when building the query.
This commit implements TermsQuery(Collection<?> values, @nullable SearchExecutionContext context) in the WildCardFieldMapper to avoid memory pressure when building the query.
Currently, A term query over a wildcard field is executing creating a BinaryDvConfirmedQuery which contains an approximation query that might contain up to 10 terms queries under a boolean SHOULD plus an automaton. This is a pretty big query by itself. More over a terms query over a wildcard field creates one of those queries per term, so we can easily build a huge query that can use many GiB of heap. This PR implements
TermsQuery(Collection<?> values, @Nullable SearchExecutionContext context)
to avoid memory pressure.For small number of terms (< 16), it will approximate each term as we are doing now but it will create just a BinaryDvConfirmedQuery with a boolean query containing the approximation of each term.
For bigger number of terms, we swap to a TermsInSetQuery. We only need one token per term to make sure we are matching all the required documents. Ideally this token should be the one that makes the term most unique, although that's different to compute when generating the query.
Performance test shows a big improvement when most of the terms matches some documents in the index but it behaves (slightly) worst when none of the terms matches. All in all I think it is a good improvement that prevent smemory isues on such a queries.
fixes #128201