Skip to content

Update sparse_vector field mapping to include default setting for token pruning #126739

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

markjhoy
Copy link
Contributor

@markjhoy markjhoy commented Apr 12, 2025

Updates the SparseVectorFieldMapper type to include index options for pruning tokens and associated configuration values.

Before this update, token pruning for sparse vector types is only available via the query (see parameters for the sparse vector query ).

With this PR, by default, any new indices with a sparse_vector field type will by default have token pruning turned on.

Example:

{
  "properties": {
    "example_field": {
       "type": "sparse_vector",
        "index_options": {
          "prune": (boolean, default is `true`),
          "pruning_config": {
            "tokens_freq_ratio_threshold": (integer, range 1-100, default is 5),
            "tokens_weight_threshold": (double, range 0.0-1.0, default if 0.4)
          }
        }
     }
  }
}

@markjhoy markjhoy requested review from Mikep86 and leemthompo June 2, 2025 23:06
Copy link
Contributor

@Mikep86 Mikep86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting closer with this, the integration tests add important test coverage. But we're still missing coverage of default (i.e. no index options) handling on old index versions and index option handling on old index versions.

Comment on lines 159 to 169
boolean shouldUseDefaultTokens = (testQueryShouldNotPrune == false && testHasIndexOptions == false);
TokenPruningConfig queryPruningConfig = overrideQueryPruningConfig ? new TokenPruningConfig(3f, 0.5f, true) : null;

SparseVectorQueryBuilder queryBuilder = new SparseVectorQueryBuilder(
SPARSE_VECTOR_FIELD,
shouldUseDefaultTokens ? SEARCH_WEIGHTED_TOKENS_WITH_DEFAULTS : SEARCH_WEIGHTED_TOKENS,
null,
null,
overrideQueryPruningConfig ? Boolean.TRUE : (testQueryShouldNotPrune ? false : null),
queryPruningConfig
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following all the logic here, but I suspect that you can simplify some of these conditional checks. Usage of Boolean.TRUE is usually a good indicator of this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to test to have an explicit pruning config in the query (the "Boolean.TRUE"), vs, if the query should outright override prune: false, vs, just leaving it blank (null)

@markjhoy markjhoy requested a review from Mikep86 June 4, 2025 23:32
@markjhoy
Copy link
Contributor Author

markjhoy commented Jun 5, 2025

buildkite test this

@benwtrent benwtrent self-requested a review June 5, 2025 17:37
Copy link
Member

@kderusso kderusso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@markjhoy This is coming together nicely, and I think we're close. I've found a few items that we need to address, in addition to the concerns that @Mikep86 has flagged. But I think this is getting a lot closer, and your cleanup looks really good!

: (Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.

`pruning_config` {applies_to}`stack: preview 9.1`
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur.
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.

@@ -24,6 +24,28 @@ PUT my-index
}
```

Also, with optional `index_options` for pruning:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add some clarification here, RE: why you might want to override token pruning?

underground: 0.053516876
is: 0.54600334

- match: { hits.total.value: 3 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try validating score differences here as well? I know that can be tricky due to different shard counts, and since you demonstrate returned documents are different it isn't a deal breaker, but figured I'd note it here.

cats: 0.5
is: 0.04600334

- match: { hits.total.value: 0 }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test feels a little incomplete. It would be nice to get a test case that actually returns some hits, and then repeat the same test with pruning explicitly disabled, kind of like what you did above.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same feedback for this test

}

@ParametersFactory
public static Iterable<Object[]> parameters() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, but we probably could have kept this simpler by introducing more randomness in the test and trusting that if issues come up, they'll come up after running the tests repeatedly.

// if we're overriding the index pruning config in the query, always prune
// if not, and the query should _not_ prune, set prune=false,
// else, set to `null` to let the index options propagate
Boolean shouldPrune = overrideQueryPruningConfig ? Boolean.TRUE : (testQueryShouldNotPrune ? Boolean.FALSE : null);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another test coverage note, we're only ever overriding with prune: false and never a custom pruning config that's different from what is in the mappings. This should be taken care of in yaml tests and is probably fine, but noting for completeness.

@@ -124,7 +126,12 @@ public SparseVectorQueryBuilder(
public SparseVectorQueryBuilder(StreamInput in) throws IOException {
super(in);
this.fieldName = in.readString();
this.shouldPruneTokens = in.readBoolean();
if (in.getTransportVersion().isPatchFrom(SPARSE_VECTOR_FIELD_PRUNING_OPTIONS_8_19)
|| in.getTransportVersion().onOrAfter(TransportVersions.SPARSE_VECTOR_FIELD_PRUNING_OPTIONS)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Use a static import for SPARSE_VECTOR_FIELD_PRUNING_OPTIONS

? WeightedTokensUtils.queryBuilderWithPrunedTokens(fieldName, tokenPruningConfig, queryVectors, ft, context)
TokenPruningConfig pruningConfig = getTokenPruningConfigForQuery(ft, context);

return pruningConfig != null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a bug here. If you create an index with a pruning_config specified, and then send in a sparse_vector query with just prune: true this works as intended because it pulls the pruning configuration from your mappings. HOWEVER, if you create an index with no prune or pruning_config specified, but then send in a sparse_vector query with simply prune: true we will have no pruning configuration (we assume we use the defaults) and this will end up not pruning.

Comment on lines +355 to +357
if (context.searcher() == null) {
return null;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this means we're missing mocking in the tests, I don't know if this should be in non-test code.

@kderusso
Copy link
Member

kderusso commented Jun 5, 2025

One more note - @markjhoy when you're in a place to do so could you please apply the test-full-bwc label to this PR? Just out of an abundance of caution to make sure the full suite of BWC tests runs. Thanks!

Copy link
Member

@benwtrent benwtrent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code confuses me.

Is this adjusting the default behavior of new sparse vector fields to prune by default?

If so, we should make sure the default value for the index options reflects that. It seems weird to hide all that logic in the query.

Comment on lines +359 to +368
// if we are not on a supported index version, do not prune by default
// nor do we check the index options, so, we'll return a pruning config only if the query specifies it.
if (context.indexVersionCreated().onOrAfter(IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT) == false
&& context.indexVersionCreated()
.between(
IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT_BACKPORT_8_X,
IndexVersions.UPGRADE_TO_LUCENE_10_0_0
) == false) {
return (shouldPruneTokens != null && shouldPruneTokens) ? tokenPruningConfig : null;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this is strictly necessary. Do we need to guard on index versions here?

Comment on lines +383 to +393
// if we're here, we should prune if set or by default
// if we don't have a pruning config, use the default
pruningConfigToUse = pruningConfigToUse == null
? new TokenPruningConfig(
TokenPruningConfig.DEFAULT_TOKENS_FREQ_RATIO_THRESHOLD,
TokenPruningConfig.DEFAULT_TOKENS_WEIGHT_THRESHOLD,
false
)
: pruningConfigToUse;

return pruningConfigToUse;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be in the mapper. Why aren't we defaulting specifically to true in the mapper?


private static final ConstructingObjectParser<IndexOptions, Void> INDEX_OPTIONS_PARSER = new ConstructingObjectParser<>(
SPARSE_VECTOR_INDEX_OPTIONS,
args -> new IndexOptions((Boolean) args[0], (TokenPruningConfig) args[1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why we are using a nullable boolean here. Let's default appropriately and be clear about the configuration in the mapping as we are changing default behavior for users.

- match: { sparse_vector_pruning_test.mappings.properties.ml.properties.tokens.index_options.pruning_config.tokens_weight_threshold: 0.4 }

---
"Check sparse_vector token pruning index_options mappings defaults":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the difference between this test and the last one. Both are testing "defaults".

I would expect the new default to be prune: true right? Why is it null?

Copy link
Member

@pquentin pquentin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered adding YAML tests to help validating the modeling of this API change in the Elasticsearch specification?

@markjhoy
Copy link
Contributor Author

markjhoy commented Jun 6, 2025

closing this PR, but keeping it around as this will be replaced after refactoring is complete on this PR: #129020

@markjhoy markjhoy closed this Jun 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types :SearchOrg/Relevance Label for the Search (solution/org) Relevance team Team:Search - Relevance The Search organization Search Relevance team Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch Team:SearchOrg Meta label for the Search Org (Enterprise Search) v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants