Skip to content

Improve execution of terms queries over wildcard fields #128826

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jun 3, 2025

Currently, A term query over a wildcard field is executing creating a BinaryDvConfirmedAutomatonQuery which contains an approximation query that might contain up to 10 terms queries under a boolean SHOULD plus an automaton. This is a pretty big query by itself. More over a terms query over a wildcard field creates one of those queries per term, so we can easily build a huge query that can use many GiB of heap.

This PR addresses this issue by adding a BinaryDvConfirmedTermsQuery to execute terms queries which uses binary search over terms instead of an automaton. For terms queries with many terms, we can just build one of those queries where for approximation, we collect one ngram per term and perform a TermInSetQuery. This approach should use just a fraction of heap compared with the previous approach and should behave better in adversarial cases.

fixes #128201

@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jun 3, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @iverase, I've created a changelog YAML for you.

@benwtrent
Copy link
Member

@iverase do we have any performance tests that exercise these code paths? This seems like it SHOULD be better overall resource utilization (less heap), at minimal expense. Could we confirm that?

@iverase
Copy link
Contributor Author

iverase commented Jun 4, 2025

@benwtrent I performed the following test.

I indexed the request field from the rally track http_logs as a wildcard field in one index. This is around ~250 million documents. The content of that field is quite structured looking like : "GET /french/images/nav_logo_sponsors.gif HTTP/1.1".

Then I performed the following queries for different sizes of the terms list (1, 10, 100, 1000, 10000):

GET logs_wildcard/_search
{
  "track_total_hits": true,
  "query": {
    "terms": {
      "request": [
         .....
      ]
    }
  }
}

Note that I want to get the total number of hits (track_total_hits": true) as I am interested in knowing how long it takes to match all documents.

For the first case, I created the list using existing records in the index so it matches many documents. For the baseline I had to run it with at least 10GB of heap, otherwise it will crash. For the new approach, memory is not an issue any more. The results looks like:

size     | baseline(ms)        | contender(ms)
1        | 7724                | 7362
10       | 19218               | 13093
100      | 190000              | 13590
1000     | 1349294             | 15490
10000    | too long            | 17446

In this case, the current implementation quickly degrades and it becomes unusable with hundreds of terms while the new solution seems pretty stable.

The next case I produced the list with random strings so nothing matches. The results looks like:

size     | baseline(ms)        | contender(ms)
1        | 13                  | 12
10       | 38                  | 41
100      | 420                 | 12976
1000     | 1672                | 17631
10000    | 7676                | 14041

In this case the current solution is able to accelerate the queries, as queries are probably rewritten to MatchNone. The proposed change still needs to scan most of the docs so it is expected to be slower. It still looks ok for big number of elements in the list.

So all in all, I think the current approach removes all the edge cases from the current implementation. There will be slowdowns on some cases but I think it is ok as those are not the main uses cases for this type of field.

One thing to think about is id the TermsInSetQuery makes sense because running this test with a MatchAllDocsQuery, I get better numbers although not much better.

if (tokens.isEmpty() == false) {
// If there are tokens, we take the middle one to represent the term
// which is probably the most different one.
tokenList.add(getMiddleToken(tokens));
Copy link
Contributor

@mayya-sharipova mayya-sharipova Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why middle token? I would think that the first token is the most different?

Should we at least use several tokens (may be first, middle, last)?

Copy link
Contributor Author

@iverase iverase Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use only one token because that would be enough to make sure we match documents we are interested in. Adding more tokens will add unnecessary documents.

Because this is a wildcard field, I expect the entries to have some structure like urls so they can be queried using * and ? .

I was thinking in urls when deciding to the middle character. The beginning of urls is normally the same and the end in many cases too. So what it makes them unique might be probably happening somewhere else in the middle.

I am not sure how much is this helping, if it proves controversial we can move to a MatchAllDocsQuery and do a linear scan. It is probably better than what we have today.

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jun 4, 2025

@iverase Great ideas and great optimizations!
I have several questions:

  • What kind of query led to expansion to 27K BinaryDvConfirmedAutomatonQuery ? The issue doesn't provide example. Was it a terms query where a user provided all these terms? May be, we should not allow this types of queries at all, rather then accommodate to them.

It looks like this PR introduces 2 optimizations for terms query:

  • make aproxQuery smaller when there are many terms; prune ngrams generated for a term to a single middle ngram
  • use binarySearch over terms instead of automaton

Do you think it makes sense to break them into separate PRs?


Another concern I have is about pruning strategy and if it makes approximate query less useful:

  • do you think it makes sense to have more ngrams per term (3 min comes to my mind).
  • TermsInSetQuery considers all ngrams optional while boolean with MUST clauses would be a better match:

@iverase
Copy link
Contributor Author

iverase commented Jun 4, 2025

What kind of query led to expansion to 27K BinaryDvConfirmedAutomatonQuery ? The issue doesn't provide example. Was it a terms query where a user provided all these terms? May be, we should not allow this types of queries at all, rather then accommodate to them.

The issue at the moment is that for a terms query, we are generating one BinaryDvConfirmedAutomatonQuery per term, so you will need a terms query with 27k terms. If you run WildcardSearchIT on the current implementation, it will go OOM.

Do you think it makes sense to break them into separate PRs?

Sure, I just want to have it asap.

do you think it makes sense to have more ngrams per term (3 min comes to my mind).

TermsInSetQuery makes an OR on the provided terms so in order to match all the documents we are interested in, we only need on term, adding more it will increase the documents we are matching unnecessary.

TermsInSetQuery considers all ngrams optional while boolean with MUST clauses would be a better match:

If we start using boolean MUST, we end up with the same issue as we are having know, using tons of heap. So no, I don't think this is an option.

@iverase
Copy link
Contributor Author

iverase commented Jun 4, 2025

@mayya-sharipova I opened #128920. This is mostly the same PR except I am not implementing the method public Query termsQuery(Collection<?> values, @Nullable SearchExecutionContext context) and I am not adding the new test as it will fail with an OOM.

@iverase
Copy link
Contributor Author

iverase commented Jun 5, 2025

superseded by #128986

@iverase iverase closed this Jun 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single query on the uri wildcard field caused OOM
4 participants