Update sparse_vector field mapping to include default setting for token pruning #126739

markjhoy · 2025-04-12T01:05:31Z

Updates the SparseVectorFieldMapper type to include index options for pruning tokens and associated configuration values.

Before this update, token pruning for sparse vector types is only available via the query (see parameters for the sparse vector query ).

With this PR, by default, any new indices with a sparse_vector field type will by default have token pruning turned on.

Example:

{
  "properties": {
    "example_field": {
       "type": "sparse_vector",
        "index_options": {
          "prune": (boolean, default is `true`),
          "pruning_config": {
            "tokens_freq_ratio_threshold": (integer, range 1-100, default is 5),
            "tokens_weight_threshold": (double, range 0.0-1.0, default if 0.4)
          }
        }
     }
  }
}

Mikep86

Getting closer with this, the integration tests add important test coverage. But we're still missing coverage of default (i.e. no index options) handling on old index versions and index option handling on old index versions.

server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java

server/src/test/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapperTests.java

...ernalClusterTest/java/org/elasticsearch/xpack/core/ml/search/SparseVectorIndexOptionsIT.java

Mikep86 · 2025-06-03T14:03:44Z

...ernalClusterTest/java/org/elasticsearch/xpack/core/ml/search/SparseVectorIndexOptionsIT.java

+        boolean shouldUseDefaultTokens = (testQueryShouldNotPrune == false && testHasIndexOptions == false);
+        TokenPruningConfig queryPruningConfig = overrideQueryPruningConfig ? new TokenPruningConfig(3f, 0.5f, true) : null;
+
+        SparseVectorQueryBuilder queryBuilder = new SparseVectorQueryBuilder(
+            SPARSE_VECTOR_FIELD,
+            shouldUseDefaultTokens ? SEARCH_WEIGHTED_TOKENS_WITH_DEFAULTS : SEARCH_WEIGHTED_TOKENS,
+            null,
+            null,
+            overrideQueryPruningConfig ? Boolean.TRUE : (testQueryShouldNotPrune ? false : null),
+            queryPruningConfig
+        );


I'm not following all the logic here, but I suspect that you can simplify some of these conditional checks. Usage of Boolean.TRUE is usually a good indicator of this.

This is to test to have an explicit pruning config in the query (the "Boolean.TRUE"), vs, if the query should outright override prune: false, vs, just leaving it blank (null)

...core/src/test/java/org/elasticsearch/xpack/core/ml/search/SparseVectorQueryBuilderTests.java

.../src/test/java/org/elasticsearch/xpack/inference/highlight/SemanticTextHighlighterTests.java

...tests-with-security/src/test/resources/rest-api-spec/test/multi_cluster/50_sparse_vector.yml

server/src/main/java/org/elasticsearch/index/mapper/vectors/TokenPruningConfig.java

markjhoy · 2025-06-05T16:07:21Z

buildkite test this

kderusso

@markjhoy This is coming together nicely, and I think we're close. I've found a few items that we need to address, in addition to the concerns that @Mikep86 has flagged. But I think this is getting a lot closer, and your cleanup looks really good!

kderusso · 2025-06-05T13:13:48Z

docs/reference/elasticsearch/mapping-reference/sparse-vector.md

+:   (Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true.
+
+`pruning_config` {applies_to}`stack: preview 9.1`
+:   (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur.


Suggested change

: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur.

: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.

kderusso · 2025-06-05T13:54:21Z

docs/reference/elasticsearch/mapping-reference/sparse-vector.md

@@ -24,6 +24,28 @@ PUT my-index
 }
 ```

+Also, with optional `index_options` for pruning:


Maybe add some clarification here, RE: why you might want to override token pruning?

kderusso · 2025-06-05T14:28:23Z

...tests-with-security/src/test/resources/rest-api-spec/test/multi_cluster/50_sparse_vector.yml

+                underground: 0.053516876
+                is: 0.54600334
+
+  - match: { hits.total.value: 3 }


Did you try validating score differences here as well? I know that can be tricky due to different shard counts, and since you demonstrate returned documents are different it isn't a deal breaker, but figured I'd note it here.

kderusso · 2025-06-05T14:30:06Z

...tests-with-security/src/test/resources/rest-api-spec/test/multi_cluster/50_sparse_vector.yml

+                cats: 0.5
+                is: 0.04600334
+
+  - match: { hits.total.value: 0 }


This test feels a little incomplete. It would be nice to get a test case that actually returns some hits, and then repeat the same test with pruning explicitly disabled, kind of like what you did above.

kderusso · 2025-06-05T14:30:34Z

...ests-with-security/src/test/resources/rest-api-spec/test/remote_cluster/50_sparse_vector.yml

Same feedback for this test

kderusso · 2025-06-05T17:18:48Z

...ernalClusterTest/java/org/elasticsearch/xpack/core/ml/search/SparseVectorIndexOptionsIT.java

+    }
+
+    @ParametersFactory
+    public static Iterable<Object[]> parameters() throws Exception {


I think this is fine, but we probably could have kept this simpler by introducing more randomness in the test and trusting that if issues come up, they'll come up after running the tests repeatedly.

kderusso · 2025-06-05T17:21:45Z

...ernalClusterTest/java/org/elasticsearch/xpack/core/ml/search/SparseVectorIndexOptionsIT.java

+        // if we're overriding the index pruning config in the query, always prune
+        // if not, and the query should _not_ prune, set prune=false,
+        // else, set to `null` to let the index options propagate
+        Boolean shouldPrune = overrideQueryPruningConfig ? Boolean.TRUE : (testQueryShouldNotPrune ? Boolean.FALSE : null);


Another test coverage note, we're only ever overriding with prune: false and never a custom pruning config that's different from what is in the mappings. This should be taken care of in yaml tests and is probably fine, but noting for completeness.

kderusso · 2025-06-05T17:24:15Z

...ugin/core/src/main/java/org/elasticsearch/xpack/core/ml/search/SparseVectorQueryBuilder.java

@@ -124,7 +126,12 @@ public SparseVectorQueryBuilder(
    public SparseVectorQueryBuilder(StreamInput in) throws IOException {
        super(in);
        this.fieldName = in.readString();
-        this.shouldPruneTokens = in.readBoolean();
+        if (in.getTransportVersion().isPatchFrom(SPARSE_VECTOR_FIELD_PRUNING_OPTIONS_8_19)
+            || in.getTransportVersion().onOrAfter(TransportVersions.SPARSE_VECTOR_FIELD_PRUNING_OPTIONS)) {


Nitpick: Use a static import for SPARSE_VECTOR_FIELD_PRUNING_OPTIONS

kderusso · 2025-06-05T17:31:50Z

...ugin/core/src/main/java/org/elasticsearch/xpack/core/ml/search/SparseVectorQueryBuilder.java

-            ? WeightedTokensUtils.queryBuilderWithPrunedTokens(fieldName, tokenPruningConfig, queryVectors, ft, context)
+        TokenPruningConfig pruningConfig = getTokenPruningConfigForQuery(ft, context);
+
+        return pruningConfig != null


I think there's a bug here. If you create an index with a pruning_config specified, and then send in a sparse_vector query with just prune: true this works as intended because it pulls the pruning configuration from your mappings. HOWEVER, if you create an index with no prune or pruning_config specified, but then send in a sparse_vector query with simply prune: true we will have no pruning configuration (we assume we use the defaults) and this will end up not pruning.

kderusso · 2025-06-05T17:34:14Z

...ugin/core/src/main/java/org/elasticsearch/xpack/core/ml/search/SparseVectorQueryBuilder.java

+        if (context.searcher() == null) {
+            return null;
+        }


I feel like this means we're missing mocking in the tests, I don't know if this should be in non-test code.

kderusso · 2025-06-05T17:47:50Z

One more note - @markjhoy when you're in a place to do so could you please apply the test-full-bwc label to this PR? Just out of an abundance of caution to make sure the full suite of BWC tests runs. Thanks!

benwtrent

The code confuses me.

Is this adjusting the default behavior of new sparse vector fields to prune by default?

If so, we should make sure the default value for the index options reflects that. It seems weird to hide all that logic in the query.

benwtrent · 2025-06-05T17:56:01Z

...ugin/core/src/main/java/org/elasticsearch/xpack/core/ml/search/SparseVectorQueryBuilder.java

+        // if we are not on a supported index version, do not prune by default
+        // nor do we check the index options, so, we'll return a pruning config only if the query specifies it.
+        if (context.indexVersionCreated().onOrAfter(IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT) == false
+            && context.indexVersionCreated()
+                .between(
+                    IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT_BACKPORT_8_X,
+                    IndexVersions.UPGRADE_TO_LUCENE_10_0_0
+                ) == false) {
+            return (shouldPruneTokens != null && shouldPruneTokens) ? tokenPruningConfig : null;
+        }


I am not sure this is strictly necessary. Do we need to guard on index versions here?

benwtrent · 2025-06-05T18:02:30Z

...ugin/core/src/main/java/org/elasticsearch/xpack/core/ml/search/SparseVectorQueryBuilder.java

+        // if we're here, we should prune if set or by default
+        // if we don't have a pruning config, use the default
+        pruningConfigToUse = pruningConfigToUse == null
+            ? new TokenPruningConfig(
+                TokenPruningConfig.DEFAULT_TOKENS_FREQ_RATIO_THRESHOLD,
+                TokenPruningConfig.DEFAULT_TOKENS_WEIGHT_THRESHOLD,
+                false
+            )
+            : pruningConfigToUse;
+
+        return pruningConfigToUse;


This logic should be in the mapper. Why aren't we defaulting specifically to true in the mapper?

benwtrent · 2025-06-05T18:03:12Z

server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java

+
+    private static final ConstructingObjectParser<IndexOptions, Void> INDEX_OPTIONS_PARSER = new ConstructingObjectParser<>(
+        SPARSE_VECTOR_INDEX_OPTIONS,
+        args -> new IndexOptions((Boolean) args[0], (TokenPruningConfig) args[1])


I don't know why we are using a nullable boolean here. Let's default appropriately and be clear about the configuration in the mapping as we are changing default behavior for users.

benwtrent · 2025-06-05T18:03:39Z

...tests-with-security/src/test/resources/rest-api-spec/test/multi_cluster/50_sparse_vector.yml

+  - match: { sparse_vector_pruning_test.mappings.properties.ml.properties.tokens.index_options.pruning_config.tokens_weight_threshold: 0.4 }
+
+---
+"Check sparse_vector token pruning index_options mappings defaults":


I don't understand the difference between this test and the last one. Both are testing "defaults".

I would expect the new default to be prune: true right? Why is it null?

pquentin

Have we considered adding YAML tests to help validating the modeling of this API change in the Elasticsearch specification?

markjhoy · 2025-06-06T13:10:08Z

closing this PR, but keeping it around as this will be replaced after refactoring is complete on this PR: #129020

Initial checkin - needs tests

e02cd3a

elasticsearchmachine added the v9.1.0 label Apr 12, 2025

markjhoy added 2 commits April 11, 2025 21:34

Missing s in IndexVersions

e24ab76

add changelog and docs for index_options

f39b78a

github-actions bot deployed to docs-preview April 15, 2025 00:50 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

eeebfd8

github-actions bot deployed to docs-preview April 15, 2025 00:56 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

51aab0c

github-actions bot deployed to docs-preview April 21, 2025 13:29 View deployment

correct index version

983ddf1

github-actions bot deployed to docs-preview April 21, 2025 14:06 View deployment

update tests

9545a0c

github-actions bot deployed to docs-preview April 21, 2025 17:53 View deployment

Complete tests for SparseVectorFieldMapper

19fe72d

github-actions bot deployed to docs-preview April 22, 2025 14:22 View deployment

[CI] Auto commit changes from spotless

58f9909

github-actions bot deployed to docs-preview April 22, 2025 14:32 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

5f8e7b9

github-actions bot deployed to docs-preview April 25, 2025 12:34 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

d7d27ba

github-actions bot deployed to docs-preview April 25, 2025 12:47 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

d342656

github-actions bot deployed to docs-preview April 25, 2025 18:51 View deployment

fix docs

96096ba

github-actions bot deployed to docs-preview April 25, 2025 19:01 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

eed88c6

github-actions bot deployed to docs-preview April 25, 2025 20:11 View deployment

fix lint

6a6052a

github-actions bot deployed to docs-preview April 25, 2025 21:03 View deployment

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

f977ea8

elasticsearchmachine and others added 5 commits June 2, 2025 16:01

[CI] Auto commit changes from spotless

fe2e267

cleanups; add query pruning override random test

9953513

[CI] Auto commit changes from spotless

925173c

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

d7b064b

check for supported index version index_options

04a597e

markjhoy requested review from Mikep86 and leemthompo June 2, 2025 23:06

elasticsearchmachine and others added 2 commits June 2, 2025 23:12

[CI] Auto commit changes from spotless

bbcd309

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

8ba9aef

Mikep86 reviewed Jun 3, 2025

View reviewed changes

markjhoy and others added 5 commits June 4, 2025 09:07

cleanups and refactoring

c4c65b9

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

9b7327b

[CI] Auto commit changes from spotless

88fc1f4

clean SparseVectorQueryBuilderTests

c0704df

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

368f8be

markjhoy mentioned this pull request Jun 4, 2025

Mark Token Pruning for Sparse Vector as GA #128854

Open

markjhoy added 3 commits June 4, 2025 13:17

fix failing test

a43d2e3

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

c755cc0

clean fix integration / rest tests

7a41e0f

markjhoy requested a review from Mikep86 June 4, 2025 23:32

markjhoy added 2 commits June 5, 2025 09:14

fix yaml default pruning tests

c02a647

Merge branch 'main' into markjhoy/default_token_pruning_sparse_vector

811ca1a

benwtrent self-requested a review June 5, 2025 17:37

kderusso reviewed Jun 5, 2025

View reviewed changes

benwtrent reviewed Jun 5, 2025

View reviewed changes

pquentin reviewed Jun 6, 2025

View reviewed changes

markjhoy closed this Jun 6, 2025

	: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur.
	: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur.

Update sparse_vector field mapping to include default setting for token pruning #126739

Update sparse_vector field mapping to include default setting for token pruning #126739

Uh oh!

Conversation

markjhoy commented Apr 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Mikep86 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

markjhoy commented Jun 5, 2025

Uh oh!

kderusso left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kderusso commented Jun 5, 2025

Uh oh!

benwtrent left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pquentin left a comment

Choose a reason for hiding this comment

Uh oh!

markjhoy commented Jun 6, 2025

Uh oh!

Uh oh!

markjhoy commented Apr 12, 2025 •

edited

Loading