Skip to content

Improve execution of terms queries over wildcard fields #128986

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 6, 2025

Conversation

iverase
Copy link
Contributor

@iverase iverase commented Jun 5, 2025

Currently, A term query over a wildcard field is executing creating a BinaryDvConfirmedQuery which contains an approximation query that might contain up to 10 terms queries under a boolean SHOULD plus an automaton. This is a pretty big query by itself. More over a terms query over a wildcard field creates one of those queries per term, so we can easily build a huge query that can use many GiB of heap. This PR implements TermsQuery(Collection<?> values, @Nullable SearchExecutionContext context) to avoid memory pressure.

For small number of terms (< 16), it will approximate each term as we are doing now but it will create just a BinaryDvConfirmedQuery with a boolean query containing the approximation of each term.

For bigger number of terms, we swap to a TermsInSetQuery. We only need one token per term to make sure we are matching all the required documents. Ideally this token should be the one that makes the term most unique, although that's different to compute when generating the query.

Performance test shows a big improvement when most of the terms matches some documents in the index but it behaves (slightly) worst when none of the terms matches. All in all I think it is a good improvement that prevent smemory isues on such a queries.

fixes #128201

@elasticsearchmachine
Copy link
Collaborator

Hi @iverase, I've created a changelog YAML for you.

@elasticsearchmachine elasticsearchmachine added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jun 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jun 5, 2025

@iverase Before merging this PR do we want to add "terms query on a wildcard field" operation in our rally benchmarks (for example, http_logs already has a wildcard field) to measure the improve performance?

@iverase
Copy link
Contributor Author

iverase commented Jun 5, 2025

http_logs is using a match_only_text field instead of a wildcard field?

https://github.com/elastic/rally-tracks/blob/master/http_logs/index.json

@mayya-sharipova
Copy link
Contributor

mayya-sharipova commented Jun 5, 2025

@iverase For runtime schedule there is an indexed wildcard field. I wonder if we can add terms query on that.

}
}

public void testTermsQueryDuel() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How long this test will run in number of terms is 8192? Is the performance acceptable to be included as a regular test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is very fast, the full suite takes a couple of seconds.

@iverase
Copy link
Contributor Author

iverase commented Jun 5, 2025

@iverase For runtime schedule there is an indexed wildcard field. I wonder if we can add terms query on that.

I will need to learn how to run this mode .... tracks are becoming so complex,

Copy link
Contributor

@mayya-sharipova mayya-sharipova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iverase Thanks, great optimization.

I am not sure about an approach to take middle toke for a term, but I trust your intuition on this.

  • Would be nice to have a dedicated benchmark for terms query to see if got speedups.

@iverase
Copy link
Contributor Author

iverase commented Jun 5, 2025

Would be nice to have a dedicated benchmark for terms query to see if got speedups.

In this case is easy to prove we got an improvement. If you run the test I have added with the old code, you will get an OOM. It will till be good to have benchmarks not to get regressions.

@iverase iverase added the auto-backport Automatically create backport pull requests when merged label Jun 6, 2025
@iverase iverase merged commit ba6987f into elastic:main Jun 6, 2025
18 checks passed
@iverase iverase deleted the wildCardTermsQuery branch June 6, 2025 12:18
@elasticsearchmachine
Copy link
Collaborator

💔 Backport failed

Status Branch Result
8.19 Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 128986

mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 9, 2025
This commit implements TermsQuery(Collection<?> values, @nullable SearchExecutionContext context) in the 
WildCardFieldMapper to avoid memory pressure when building the query.
valeriy42 pushed a commit to valeriy42/elasticsearch that referenced this pull request Jun 12, 2025
This commit implements TermsQuery(Collection<?> values, @nullable SearchExecutionContext context) in the 
WildCardFieldMapper to avoid memory pressure when building the query.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-backport Automatically create backport pull requests when merged >bug :Search Relevance/Search Catch all for Search Relevance Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v8.19.0 v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single query on the uri wildcard field caused OOM
3 participants