-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Improve execution of terms queries over wildcard fields #128826
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-search-foundations (Team:Search Foundations) |
Hi @iverase, I've created a changelog YAML for you. |
@iverase do we have any performance tests that exercise these code paths? This seems like it SHOULD be better overall resource utilization (less heap), at minimal expense. Could we confirm that? |
@benwtrent I performed the following test. I indexed the request field from the rally track http_logs as a wildcard field in one index. This is around ~250 million documents. The content of that field is quite structured looking like : Then I performed the following queries for different sizes of the terms list (1, 10, 100, 1000, 10000):
Note that I want to get the total number of hits (track_total_hits": true) as I am interested in knowing how long it takes to match all documents. For the first case, I created the list using existing records in the index so it matches many documents. For the baseline I had to run it with at least 10GB of heap, otherwise it will crash. For the new approach, memory is not an issue any more. The results looks like:
In this case, the current implementation quickly degrades and it becomes unusable with hundreds of terms while the new solution seems pretty stable. The next case I produced the list with random strings so nothing matches. The results looks like:
In this case the current solution is able to accelerate the queries, as queries are probably rewritten to MatchNone. The proposed change still needs to scan most of the docs so it is expected to be slower. It still looks ok for big number of elements in the list. So all in all, I think the current approach removes all the edge cases from the current implementation. There will be slowdowns on some cases but I think it is ok as those are not the main uses cases for this type of field. One thing to think about is id the TermsInSetQuery makes sense because running this test with a MatchAllDocsQuery, I get better numbers although not much better. |
if (tokens.isEmpty() == false) { | ||
// If there are tokens, we take the middle one to represent the term | ||
// which is probably the most different one. | ||
tokenList.add(getMiddleToken(tokens)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why middle token? I would think that the first token is the most different?
Should we at least use several tokens (may be first, middle, last)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use only one token because that would be enough to make sure we match documents we are interested in. Adding more tokens will add unnecessary documents.
Because this is a wildcard field, I expect the entries to have some structure like urls so they can be queried using *
and ?
.
I was thinking in urls when deciding to the middle character. The beginning of urls is normally the same and the end in many cases too. So what it makes them unique might be probably happening somewhere else in the middle.
I am not sure how much is this helping, if it proves controversial we can move to a MatchAllDocsQuery and do a linear scan. It is probably better than what we have today.
@iverase Great ideas and great optimizations!
It looks like this PR introduces 2 optimizations for
Do you think it makes sense to break them into separate PRs? Another concern I have is about pruning strategy and if it makes approximate query less useful:
|
The issue at the moment is that for a terms query, we are generating one BinaryDvConfirmedAutomatonQuery per term, so you will need a terms query with 27k terms. If you run WildcardSearchIT on the current implementation, it will go OOM.
Sure, I just want to have it asap.
TermsInSetQuery makes an OR on the provided terms so in order to match all the documents we are interested in, we only need on term, adding more it will increase the documents we are matching unnecessary.
If we start using boolean MUST, we end up with the same issue as we are having know, using tons of heap. So no, I don't think this is an option. |
@mayya-sharipova I opened #128920. This is mostly the same PR except I am not implementing the method |
superseded by #128986 |
Currently, A term query over a wildcard field is executing creating a
BinaryDvConfirmedAutomatonQuery
which contains an approximation query that might contain up to 10 terms queries under a booleanSHOULD
plus an automaton. This is a pretty big query by itself. More over a terms query over a wildcard field creates one of those queries per term, so we can easily build a huge query that can use many GiB of heap.This PR addresses this issue by adding a
BinaryDvConfirmedTermsQuery
to execute terms queries which uses binary search over terms instead of an automaton. For terms queries with many terms, we can just build one of those queries where for approximation, we collect one ngram per term and perform a TermInSetQuery. This approach should use just a fraction of heap compared with the previous approach and should behave better in adversarial cases.fixes #128201