-
Notifications
You must be signed in to change notification settings - Fork 25.2k
Update sparse_vector field mapping to include default setting for token pruning #126739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update sparse_vector field mapping to include default setting for token pruning #126739
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting closer with this, the integration tests add important test coverage. But we're still missing coverage of default (i.e. no index options) handling on old index versions and index option handling on old index versions.
server/src/main/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapper.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/mapper/vectors/SparseVectorFieldMapperTests.java
Show resolved
Hide resolved
...ernalClusterTest/java/org/elasticsearch/xpack/core/ml/search/SparseVectorIndexOptionsIT.java
Show resolved
Hide resolved
...ernalClusterTest/java/org/elasticsearch/xpack/core/ml/search/SparseVectorIndexOptionsIT.java
Outdated
Show resolved
Hide resolved
boolean shouldUseDefaultTokens = (testQueryShouldNotPrune == false && testHasIndexOptions == false); | ||
TokenPruningConfig queryPruningConfig = overrideQueryPruningConfig ? new TokenPruningConfig(3f, 0.5f, true) : null; | ||
|
||
SparseVectorQueryBuilder queryBuilder = new SparseVectorQueryBuilder( | ||
SPARSE_VECTOR_FIELD, | ||
shouldUseDefaultTokens ? SEARCH_WEIGHTED_TOKENS_WITH_DEFAULTS : SEARCH_WEIGHTED_TOKENS, | ||
null, | ||
null, | ||
overrideQueryPruningConfig ? Boolean.TRUE : (testQueryShouldNotPrune ? false : null), | ||
queryPruningConfig | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following all the logic here, but I suspect that you can simplify some of these conditional checks. Usage of Boolean.TRUE
is usually a good indicator of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to test to have an explicit pruning config in the query (the "Boolean.TRUE"), vs, if the query should outright override prune: false
, vs, just leaving it blank (null
)
...core/src/test/java/org/elasticsearch/xpack/core/ml/search/SparseVectorQueryBuilderTests.java
Outdated
Show resolved
Hide resolved
.../src/test/java/org/elasticsearch/xpack/inference/highlight/SemanticTextHighlighterTests.java
Outdated
Show resolved
Hide resolved
...tests-with-security/src/test/resources/rest-api-spec/test/multi_cluster/50_sparse_vector.yml
Outdated
Show resolved
Hide resolved
...tests-with-security/src/test/resources/rest-api-spec/test/multi_cluster/50_sparse_vector.yml
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/index/mapper/vectors/TokenPruningConfig.java
Outdated
Show resolved
Hide resolved
buildkite test this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: (Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true. | ||
|
||
`pruning_config` {applies_to}`stack: preview 9.1` | ||
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur. | |
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false but `pruning_config` is specified, an exception will occur. |
@@ -24,6 +24,28 @@ PUT my-index | |||
} | |||
``` | |||
|
|||
Also, with optional `index_options` for pruning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add some clarification here, RE: why you might want to override token pruning?
underground: 0.053516876 | ||
is: 0.54600334 | ||
|
||
- match: { hits.total.value: 3 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you try validating score differences here as well? I know that can be tricky due to different shard counts, and since you demonstrate returned documents are different it isn't a deal breaker, but figured I'd note it here.
cats: 0.5 | ||
is: 0.04600334 | ||
|
||
- match: { hits.total.value: 0 } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test feels a little incomplete. It would be nice to get a test case that actually returns some hits, and then repeat the same test with pruning explicitly disabled, kind of like what you did above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same feedback for this test
} | ||
|
||
@ParametersFactory | ||
public static Iterable<Object[]> parameters() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, but we probably could have kept this simpler by introducing more randomness in the test and trusting that if issues come up, they'll come up after running the tests repeatedly.
// if we're overriding the index pruning config in the query, always prune | ||
// if not, and the query should _not_ prune, set prune=false, | ||
// else, set to `null` to let the index options propagate | ||
Boolean shouldPrune = overrideQueryPruningConfig ? Boolean.TRUE : (testQueryShouldNotPrune ? Boolean.FALSE : null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another test coverage note, we're only ever overriding with prune: false
and never a custom pruning config that's different from what is in the mappings. This should be taken care of in yaml tests and is probably fine, but noting for completeness.
@@ -124,7 +126,12 @@ public SparseVectorQueryBuilder( | |||
public SparseVectorQueryBuilder(StreamInput in) throws IOException { | |||
super(in); | |||
this.fieldName = in.readString(); | |||
this.shouldPruneTokens = in.readBoolean(); | |||
if (in.getTransportVersion().isPatchFrom(SPARSE_VECTOR_FIELD_PRUNING_OPTIONS_8_19) | |||
|| in.getTransportVersion().onOrAfter(TransportVersions.SPARSE_VECTOR_FIELD_PRUNING_OPTIONS)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: Use a static import for SPARSE_VECTOR_FIELD_PRUNING_OPTIONS
? WeightedTokensUtils.queryBuilderWithPrunedTokens(fieldName, tokenPruningConfig, queryVectors, ft, context) | ||
TokenPruningConfig pruningConfig = getTokenPruningConfigForQuery(ft, context); | ||
|
||
return pruningConfig != null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's a bug here. If you create an index with a pruning_config
specified, and then send in a sparse_vector
query with just prune: true
this works as intended because it pulls the pruning configuration from your mappings. HOWEVER, if you create an index with no prune
or pruning_config
specified, but then send in a sparse_vector
query with simply prune: true
we will have no pruning configuration (we assume we use the defaults) and this will end up not pruning.
if (context.searcher() == null) { | ||
return null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this means we're missing mocking in the tests, I don't know if this should be in non-test code.
One more note - @markjhoy when you're in a place to do so could you please apply the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code confuses me.
Is this adjusting the default behavior of new sparse vector fields to prune by default?
If so, we should make sure the default value for the index options reflects that. It seems weird to hide all that logic in the query.
// if we are not on a supported index version, do not prune by default | ||
// nor do we check the index options, so, we'll return a pruning config only if the query specifies it. | ||
if (context.indexVersionCreated().onOrAfter(IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT) == false | ||
&& context.indexVersionCreated() | ||
.between( | ||
IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT_BACKPORT_8_X, | ||
IndexVersions.UPGRADE_TO_LUCENE_10_0_0 | ||
) == false) { | ||
return (shouldPruneTokens != null && shouldPruneTokens) ? tokenPruningConfig : null; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this is strictly necessary. Do we need to guard on index versions here?
// if we're here, we should prune if set or by default | ||
// if we don't have a pruning config, use the default | ||
pruningConfigToUse = pruningConfigToUse == null | ||
? new TokenPruningConfig( | ||
TokenPruningConfig.DEFAULT_TOKENS_FREQ_RATIO_THRESHOLD, | ||
TokenPruningConfig.DEFAULT_TOKENS_WEIGHT_THRESHOLD, | ||
false | ||
) | ||
: pruningConfigToUse; | ||
|
||
return pruningConfigToUse; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should be in the mapper. Why aren't we defaulting specifically to true
in the mapper?
|
||
private static final ConstructingObjectParser<IndexOptions, Void> INDEX_OPTIONS_PARSER = new ConstructingObjectParser<>( | ||
SPARSE_VECTOR_INDEX_OPTIONS, | ||
args -> new IndexOptions((Boolean) args[0], (TokenPruningConfig) args[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know why we are using a nullable boolean here. Let's default appropriately and be clear about the configuration in the mapping as we are changing default behavior for users.
- match: { sparse_vector_pruning_test.mappings.properties.ml.properties.tokens.index_options.pruning_config.tokens_weight_threshold: 0.4 } | ||
|
||
--- | ||
"Check sparse_vector token pruning index_options mappings defaults": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the difference between this test and the last one. Both are testing "defaults".
I would expect the new default to be prune: true
right? Why is it null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered adding YAML tests to help validating the modeling of this API change in the Elasticsearch specification?
closing this PR, but keeping it around as this will be replaced after refactoring is complete on this PR: #129020 |
Updates the
SparseVectorFieldMapper
type to include index options for pruning tokens and associated configuration values.Before this update, token pruning for sparse vector types is only available via the query (see parameters for the sparse vector query ).
With this PR, by default, any new indices with a
sparse_vector
field type will by default have token pruning turned on.Example: