Improve execution of terms queries over wildcard fields #128826

iverase · 2025-06-03T13:11:40Z

Currently, A term query over a wildcard field is executing creating a BinaryDvConfirmedAutomatonQuery which contains an approximation query that might contain up to 10 terms queries under a boolean SHOULD plus an automaton. This is a pretty big query by itself. More over a terms query over a wildcard field creates one of those queries per term, so we can easily build a huge query that can use many GiB of heap.

This PR addresses this issue by adding a BinaryDvConfirmedTermsQuery to execute terms queries which uses binary search over terms instead of an automaton. For terms queries with many terms, we can just build one of those queries where for approximation, we collect one ngram per term and perform a TermInSetQuery. This approach should use just a fraction of heap compared with the previous approach and should behave better in adversarial cases.

fixes #128201

elasticsearchmachine · 2025-06-03T13:12:08Z

Pinging @elastic/es-search-foundations (Team:Search Foundations)

elasticsearchmachine · 2025-06-03T13:12:35Z

Hi @iverase, I've created a changelog YAML for you.

benwtrent · 2025-06-03T17:31:37Z

@iverase do we have any performance tests that exercise these code paths? This seems like it SHOULD be better overall resource utilization (less heap), at minimal expense. Could we confirm that?

iverase · 2025-06-04T06:38:35Z

@benwtrent I performed the following test.

I indexed the request field from the rally track http_logs as a wildcard field in one index. This is around ~250 million documents. The content of that field is quite structured looking like : "GET /french/images/nav_logo_sponsors.gif HTTP/1.1".

Then I performed the following queries for different sizes of the terms list (1, 10, 100, 1000, 10000):

GET logs_wildcard/_search
{
  "track_total_hits": true,
  "query": {
    "terms": {
      "request": [
         .....
      ]
    }
  }
}

Note that I want to get the total number of hits (track_total_hits": true) as I am interested in knowing how long it takes to match all documents.

For the first case, I created the list using existing records in the index so it matches many documents. For the baseline I had to run it with at least 10GB of heap, otherwise it will crash. For the new approach, memory is not an issue any more. The results looks like:

size     | baseline(ms)        | contender(ms)
1        | 7724                | 7362
10       | 19218               | 13093
100      | 190000              | 13590
1000     | 1349294             | 15490
10000    | too long            | 17446

In this case, the current implementation quickly degrades and it becomes unusable with hundreds of terms while the new solution seems pretty stable.

The next case I produced the list with random strings so nothing matches. The results looks like:

size     | baseline(ms)        | contender(ms)
1        | 13                  | 12
10       | 38                  | 41
100      | 420                 | 12976
1000     | 1672                | 17631
10000    | 7676                | 14041

In this case the current solution is able to accelerate the queries, as queries are probably rewritten to MatchNone. The proposed change still needs to scan most of the docs so it is expected to be slower. It still looks ok for big number of elements in the list.

So all in all, I think the current approach removes all the edge cases from the current implementation. There will be slowdowns on some cases but I think it is ok as those are not the main uses cases for this type of field.

One thing to think about is id the TermsInSetQuery makes sense because running this test with a MatchAllDocsQuery, I get better numbers although not much better.

mayya-sharipova · 2025-06-04T16:15:51Z

...ugin/wildcard/src/main/java/org/elasticsearch/xpack/wildcard/mapper/WildcardFieldMapper.java

+                    if (tokens.isEmpty() == false) {
+                        // If there are tokens, we take the middle one to represent the term
+                        // which is probably the most different one.
+                        tokenList.add(getMiddleToken(tokens));


Why middle token? I would think that the first token is the most different?

Should we at least use several tokens (may be first, middle, last)?

We use only one token because that would be enough to make sure we match documents we are interested in. Adding more tokens will add unnecessary documents.

Because this is a wildcard field, I expect the entries to have some structure like urls so they can be queried using * and ? .

I was thinking in urls when deciding to the middle character. The beginning of urls is normally the same and the end in many cases too. So what it makes them unique might be probably happening somewhere else in the middle.

I am not sure how much is this helping, if it proves controversial we can move to a MatchAllDocsQuery and do a linear scan. It is probably better than what we have today.

mayya-sharipova · 2025-06-04T16:36:55Z

@iverase Great ideas and great optimizations!
I have several questions:

What kind of query led to expansion to 27K BinaryDvConfirmedAutomatonQuery ? The issue doesn't provide example. Was it a terms query where a user provided all these terms? May be, we should not allow this types of queries at all, rather then accommodate to them.

It looks like this PR introduces 2 optimizations for terms query:

make aproxQuery smaller when there are many terms; prune ngrams generated for a term to a single middle ngram
use binarySearch over terms instead of automaton

Do you think it makes sense to break them into separate PRs?

Another concern I have is about pruning strategy and if it makes approximate query less useful:

do you think it makes sense to have more ngrams per term (3 min comes to my mind).
TermsInSetQuery considers all ngrams optional while boolean with MUST clauses would be a better match:

iverase · 2025-06-04T16:52:08Z

What kind of query led to expansion to 27K BinaryDvConfirmedAutomatonQuery ? The issue doesn't provide example. Was it a terms query where a user provided all these terms? May be, we should not allow this types of queries at all, rather then accommodate to them.

The issue at the moment is that for a terms query, we are generating one BinaryDvConfirmedAutomatonQuery per term, so you will need a terms query with 27k terms. If you run WildcardSearchIT on the current implementation, it will go OOM.

Do you think it makes sense to break them into separate PRs?

Sure, I just want to have it asap.

do you think it makes sense to have more ngrams per term (3 min comes to my mind).

TermsInSetQuery makes an OR on the provided terms so in order to match all the documents we are interested in, we only need on term, adding more it will increase the documents we are matching unnecessary.

TermsInSetQuery considers all ngrams optional while boolean with MUST clauses would be a better match:

If we start using boolean MUST, we end up with the same issue as we are having know, using tons of heap. So no, I don't think this is an option.

iverase · 2025-06-04T17:27:37Z

@mayya-sharipova I opened #128920. This is mostly the same PR except I am not implementing the method public Query termsQuery(Collection<?> values, @Nullable SearchExecutionContext context) and I am not adding the new test as it will fail with an OOM.

iverase · 2025-06-05T12:57:45Z

superseded by #128986

Improve execution of terms queries over wildcard fields

3dde80a

iverase added >bug :Search Foundations/Search Catch all for Search Foundations v8.19.0 v9.1.0 labels Jun 3, 2025

elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jun 3, 2025

Update docs/changelog/128826.yaml

3b90419

mayya-sharipova reviewed Jun 4, 2025

View reviewed changes

iverase closed this Jun 5, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve execution of terms queries over wildcard fields #128826

Improve execution of terms queries over wildcard fields #128826

Uh oh!

iverase commented Jun 3, 2025 •

edited

Loading

Uh oh!

elasticsearchmachine commented Jun 3, 2025

Uh oh!

elasticsearchmachine commented Jun 3, 2025

Uh oh!

benwtrent commented Jun 3, 2025

Uh oh!

iverase commented Jun 4, 2025 •

edited

Loading

Uh oh!

mayya-sharipova Jun 4, 2025 •

edited

Loading

Uh oh!

iverase Jun 4, 2025 •

edited

Loading

Uh oh!

mayya-sharipova commented Jun 4, 2025 •

edited

Loading

Uh oh!

iverase commented Jun 4, 2025

Uh oh!

iverase commented Jun 4, 2025

Uh oh!

iverase commented Jun 5, 2025

Uh oh!

Uh oh!

Improve execution of terms queries over wildcard fields #128826

Improve execution of terms queries over wildcard fields #128826

Uh oh!

Conversation

iverase commented Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

elasticsearchmachine commented Jun 3, 2025

Uh oh!

elasticsearchmachine commented Jun 3, 2025

Uh oh!

benwtrent commented Jun 3, 2025

Uh oh!

iverase commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mayya-sharipova Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

iverase Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mayya-sharipova commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

iverase commented Jun 4, 2025

Uh oh!

iverase commented Jun 4, 2025

Uh oh!

iverase commented Jun 5, 2025

Uh oh!

Uh oh!

iverase commented Jun 3, 2025 •

edited

Loading

iverase commented Jun 4, 2025 •

edited

Loading

mayya-sharipova Jun 4, 2025 •

edited

Loading

iverase Jun 4, 2025 •

edited

Loading

mayya-sharipova commented Jun 4, 2025 •

edited

Loading