-
Notifications
You must be signed in to change notification settings - Fork 25.3k
Update sparse_vector field mapping to include default setting for token pruning #126739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
e02cd3a
e24ab76
f39b78a
eeebfd8
51aab0c
983ddf1
9545a0c
19fe72d
58f9909
5f8e7b9
d7d27ba
d342656
96096ba
eed88c6
6a6052a
f977ea8
f38a6f1
436183b
24438e3
501099d
21323e4
a282e27
9d5df84
846fcff
2b3299e
832fe45
e593f17
3086a4b
7a24703
7ceb12a
f9d44e5
af006d4
5dd4728
f3b4a98
7ddb77a
02868b1
3625a37
92db1c6
a022b5c
6a7f46c
e95033c
99c3700
bdfc9b8
20bcf20
1f0718d
e30a141
404e645
bdfcf5e
f27dfb8
283b563
65c5147
fc78b0f
15c5eb3
0b17c16
4b46300
08d51c9
ae34841
74b19ca
a341322
a47b915
4e681bd
6e50539
fcf682f
095bb28
0c8c095
5bb6561
e4d547a
6c5e253
29dae8a
31f9e6d
0b1c1d2
b48deea
7f60eca
d8f3c63
be78331
d76323d
a0cc202
223a794
dd4a218
d60e2df
a3bb904
4a23c9c
e2e65db
cf7a302
7a12676
b43b19e
28b5b74
515cf9e
1a692fc
4320315
1cd6f5e
c5b9def
fd07f1c
42bc77d
8f6672f
a6bdc90
f499269
5e9fee1
6209919
5fefadb
a4bb87c
1192271
d7f1351
94e21df
212c850
c8fcd94
fa1737d
7d2f361
0f9bd65
2afeb2b
60a6b3e
9fd5e72
fe2e267
9953513
925173c
d7b064b
04a597e
bbcd309
8ba9aef
c4c65b9
9b7327b
88fc1f4
c0704df
368f8be
a43d2e3
c755cc0
7a41e0f
c02a647
811ca1a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
pr: 126739 | ||
summary: Update `sparse_vector` field mapping to include default setting for token | ||
pruning | ||
area: Mapping | ||
type: enhancement | ||
issues: [] |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -24,6 +24,28 @@ PUT my-index | |||||
} | ||||||
``` | ||||||
|
||||||
Also, with optional `index_options` for pruning: | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe add some clarification here, RE: why you might want to override token pruning? |
||||||
|
||||||
```console | ||||||
PUT my-index | ||||||
{ | ||||||
"mappings": { | ||||||
"properties": { | ||||||
"text.tokens": { | ||||||
"type": "sparse_vector", | ||||||
kderusso marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
"index_options": { | ||||||
"prune": true, | ||||||
"pruning_config": { | ||||||
"tokens_freq_ratio_threshold": 5, | ||||||
"tokens_weight_threshold": 0.4 | ||||||
} | ||||||
} | ||||||
} | ||||||
} | ||||||
} | ||||||
} | ||||||
``` | ||||||
|
||||||
See [semantic search with ELSER](docs-content://solutions/search/semantic-search/semantic-search-elser-ingest-pipelines.md) for a complete example on adding documents to a `sparse_vector` mapped field using ELSER. | ||||||
|
||||||
## Parameters for `sparse_vector` fields [sparse-vectors-params] | ||||||
|
@@ -36,6 +58,28 @@ The following parameters are accepted by `sparse_vector` fields: | |||||
* Exclude the field from [_source](/reference/elasticsearch/rest-apis/retrieve-selected-fields.md#source-filtering). | ||||||
* Use [synthetic `_source`](/reference/elasticsearch/mapping-reference/mapping-source-field.md#synthetic-source). | ||||||
|
||||||
index_options | ||||||
: (Optional, object) You can set index options for your `sparse_vector` field to determine if you should prune tokens, and the parameter configurations for the token pruning. If pruning options are not set in your `sparse_query` vector, Elasticsearch will use the default options configured for the field, if any. The available options for the index options are: | ||||||
|
||||||
Parameters for `index_options` are: | ||||||
|
||||||
`prune` {applies_to}`stack: preview 9.1` | ||||||
: (Optional, boolean) Whether to perform pruning, omitting the non-significant tokens from the query to improve query performance. If `prune` is true but the `pruning_config` is not specified, pruning will occur but default values will be used. Default: true. | ||||||
|
||||||
`pruning_config` {applies_to}`stack: preview 9.1` | ||||||
: (Optional, object) Optional pruning configuration. If enabled, this will omit non-significant tokens from the query in order to improve query performance. This is only used if `prune` is set to `true`. If `prune` is set to `true` but `pruning_config` is not specified, default values will be used. If `prune` is set to false, an exception will occur. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
|
||||||
Parameters for `pruning_config` include: | ||||||
|
||||||
`tokens_freq_ratio_threshold` {applies_to}`stack: preview 9.1` | ||||||
: (Optional, integer) Tokens whose frequency is more than `tokens_freq_ratio_threshold` times the average frequency of all tokens in the specified field are considered outliers and pruned. This value must between 1 and 100. Default: `5`. | ||||||
|
||||||
`tokens_weight_threshold` {applies_to}`stack: preview 9.1` | ||||||
: (Optional, float) Tokens whose weight is less than `tokens_weight_threshold` are considered insignificant and pruned. This value must be between 0 and 1. Default: `0.4`. | ||||||
|
||||||
::::{note} | ||||||
The default values for `tokens_freq_ratio_threshold` and `tokens_weight_threshold` were chosen based on tests using ELSERv2 that provided the most optimal results. | ||||||
:::: | ||||||
|
||||||
|
||||||
## Multi-value sparse vectors [index-multi-value-sparse-vectors] | ||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,9 @@ | |
import org.apache.lucene.util.BytesRef; | ||
import org.elasticsearch.common.logging.DeprecationCategory; | ||
import org.elasticsearch.common.lucene.Lucene; | ||
import org.elasticsearch.common.xcontent.support.XContentMapValues; | ||
import org.elasticsearch.core.Nullable; | ||
import org.elasticsearch.features.NodeFeature; | ||
import org.elasticsearch.index.IndexVersion; | ||
import org.elasticsearch.index.IndexVersions; | ||
import org.elasticsearch.index.analysis.NamedAnalyzer; | ||
|
@@ -31,24 +34,35 @@ | |
import org.elasticsearch.index.mapper.FieldMapper; | ||
import org.elasticsearch.index.mapper.MappedFieldType; | ||
import org.elasticsearch.index.mapper.MapperBuilderContext; | ||
import org.elasticsearch.index.mapper.MappingParserContext; | ||
import org.elasticsearch.index.mapper.SourceLoader; | ||
import org.elasticsearch.index.mapper.SourceValueFetcher; | ||
import org.elasticsearch.index.mapper.TextSearchInfo; | ||
import org.elasticsearch.index.mapper.ValueFetcher; | ||
import org.elasticsearch.index.query.SearchExecutionContext; | ||
import org.elasticsearch.search.fetch.StoredFieldsSpec; | ||
import org.elasticsearch.search.lookup.Source; | ||
import org.elasticsearch.xcontent.ConstructingObjectParser; | ||
import org.elasticsearch.xcontent.DeprecationHandler; | ||
import org.elasticsearch.xcontent.NamedXContentRegistry; | ||
import org.elasticsearch.xcontent.ParseField; | ||
import org.elasticsearch.xcontent.ToXContent; | ||
import org.elasticsearch.xcontent.XContentBuilder; | ||
import org.elasticsearch.xcontent.XContentParser; | ||
import org.elasticsearch.xcontent.XContentParser.Token; | ||
import org.elasticsearch.xcontent.XContentType; | ||
import org.elasticsearch.xcontent.support.MapXContentParser; | ||
|
||
import java.io.IOException; | ||
import java.io.UncheckedIOException; | ||
import java.util.LinkedHashMap; | ||
import java.util.List; | ||
import java.util.Map; | ||
import java.util.Objects; | ||
import java.util.stream.Stream; | ||
|
||
import static org.elasticsearch.index.query.AbstractQueryBuilder.DEFAULT_BOOST; | ||
import static org.elasticsearch.xcontent.ConstructingObjectParser.optionalConstructorArg; | ||
|
||
/** | ||
* A {@link FieldMapper} that exposes Lucene's {@link FeatureField} as a sparse | ||
|
@@ -57,6 +71,7 @@ | |
public class SparseVectorFieldMapper extends FieldMapper { | ||
|
||
public static final String CONTENT_TYPE = "sparse_vector"; | ||
public static final String SPARSE_VECTOR_INDEX_OPTIONS = "index_options"; | ||
|
||
static final String ERROR_MESSAGE_7X = "[sparse_vector] field type in old 7.x indices is allowed to " | ||
+ "contain [sparse_vector] fields, but they cannot be indexed or searched."; | ||
|
@@ -66,13 +81,26 @@ public class SparseVectorFieldMapper extends FieldMapper { | |
static final IndexVersion NEW_SPARSE_VECTOR_INDEX_VERSION = IndexVersions.NEW_SPARSE_VECTOR; | ||
static final IndexVersion SPARSE_VECTOR_IN_FIELD_NAMES_INDEX_VERSION = IndexVersions.SPARSE_VECTOR_IN_FIELD_NAMES_SUPPORT; | ||
|
||
static final IndexVersion SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_VERSION = IndexVersions.SPARSE_VECTOR_PRUNING_INDEX_OPTIONS_SUPPORT; | ||
|
||
public static final NodeFeature SPARSE_VECTOR_INDEX_OPTIONS_FEATURE = new NodeFeature("sparse_vector.index_options_supported"); | ||
|
||
private static SparseVectorFieldMapper toType(FieldMapper in) { | ||
return (SparseVectorFieldMapper) in; | ||
} | ||
|
||
public static class Builder extends FieldMapper.Builder { | ||
private final Parameter<Boolean> stored = Parameter.storeParam(m -> toType(m).fieldType().isStored(), false); | ||
private final Parameter<Map<String, String>> meta = Parameter.metaParam(); | ||
private final Parameter<IndexOptions> indexOptions = new Parameter<>( | ||
SPARSE_VECTOR_INDEX_OPTIONS, | ||
true, | ||
() -> null, | ||
(n, c, o) -> parseIndexOptions(c, o), | ||
m -> toType(m).fieldType().indexOptions, | ||
XContentBuilder::field, | ||
Objects::toString | ||
).acceptsNull(); | ||
|
||
public Builder(String name) { | ||
super(name); | ||
|
@@ -85,19 +113,54 @@ public Builder setStored(boolean value) { | |
|
||
@Override | ||
protected Parameter<?>[] getParameters() { | ||
return new Parameter<?>[] { stored, meta }; | ||
return new Parameter<?>[] { stored, meta, indexOptions }; | ||
} | ||
|
||
@Override | ||
public SparseVectorFieldMapper build(MapperBuilderContext context) { | ||
return new SparseVectorFieldMapper( | ||
leafName(), | ||
new SparseVectorFieldType(context.buildFullName(leafName()), stored.getValue(), meta.getValue()), | ||
new SparseVectorFieldType(context.buildFullName(leafName()), stored.getValue(), meta.getValue(), indexOptions.getValue()), | ||
builderParams(this, context) | ||
); | ||
} | ||
} | ||
|
||
public IndexOptions getIndexOptions() { | ||
return fieldType().getIndexOptions(); | ||
} | ||
|
||
private static final ConstructingObjectParser<IndexOptions, Void> INDEX_OPTIONS_PARSER = new ConstructingObjectParser<>( | ||
SPARSE_VECTOR_INDEX_OPTIONS, | ||
args -> new IndexOptions((Boolean) args[0], (TokenPruningConfig) args[1]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know why we are using a nullable boolean here. Let's default appropriately and be clear about the configuration in the mapping as we are changing default behavior for users. |
||
); | ||
|
||
static { | ||
INDEX_OPTIONS_PARSER.declareBoolean(optionalConstructorArg(), IndexOptions.PRUNE_FIELD_NAME); | ||
INDEX_OPTIONS_PARSER.declareObject(optionalConstructorArg(), TokenPruningConfig.PARSER, IndexOptions.PRUNING_CONFIG_FIELD_NAME); | ||
} | ||
|
||
private static SparseVectorFieldMapper.IndexOptions parseIndexOptions(MappingParserContext context, Object propNode) { | ||
markjhoy marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if (propNode == null) { | ||
return null; | ||
} | ||
|
||
Map<String, Object> indexOptionsMap = XContentMapValues.nodeMapValue(propNode, SPARSE_VECTOR_INDEX_OPTIONS); | ||
|
||
XContentParser parser = new MapXContentParser( | ||
NamedXContentRegistry.EMPTY, | ||
DeprecationHandler.IGNORE_DEPRECATIONS, | ||
indexOptionsMap, | ||
XContentType.JSON | ||
); | ||
|
||
try { | ||
return INDEX_OPTIONS_PARSER.parse(parser, null); | ||
} catch (IOException e) { | ||
throw new UncheckedIOException(e); | ||
} | ||
} | ||
|
||
public static final TypeParser PARSER = new TypeParser((n, c) -> { | ||
if (c.indexVersionCreated().before(PREVIOUS_SPARSE_VECTOR_INDEX_VERSION)) { | ||
deprecationLogger.warn(DeprecationCategory.MAPPINGS, "sparse_vector", ERROR_MESSAGE_7X); | ||
|
@@ -109,9 +172,24 @@ public SparseVectorFieldMapper build(MapperBuilderContext context) { | |
}, notInMultiFields(CONTENT_TYPE)); | ||
|
||
public static final class SparseVectorFieldType extends MappedFieldType { | ||
private final IndexOptions indexOptions; | ||
|
||
public SparseVectorFieldType(String name, boolean isStored, Map<String, String> meta) { | ||
this(name, isStored, meta, null); | ||
} | ||
|
||
public SparseVectorFieldType( | ||
String name, | ||
boolean isStored, | ||
Map<String, String> meta, | ||
@Nullable SparseVectorFieldMapper.IndexOptions indexOptions | ||
) { | ||
super(name, true, isStored, false, TextSearchInfo.SIMPLE_MATCH_ONLY, meta); | ||
this.indexOptions = indexOptions; | ||
} | ||
|
||
public IndexOptions getIndexOptions() { | ||
return indexOptions; | ||
} | ||
|
||
@Override | ||
|
@@ -364,4 +442,70 @@ public void reset() { | |
} | ||
} | ||
|
||
public static class IndexOptions implements ToXContent { | ||
public static final ParseField PRUNE_FIELD_NAME = new ParseField("prune"); | ||
public static final ParseField PRUNING_CONFIG_FIELD_NAME = new ParseField("pruning_config"); | ||
|
||
final Boolean prune; | ||
final TokenPruningConfig pruningConfig; | ||
|
||
IndexOptions(@Nullable Boolean prune, @Nullable TokenPruningConfig pruningConfig) { | ||
if (pruningConfig != null && (prune == null || prune == false)) { | ||
throw new IllegalArgumentException( | ||
"[" | ||
+ SPARSE_VECTOR_INDEX_OPTIONS | ||
+ "] field [" | ||
+ PRUNING_CONFIG_FIELD_NAME.getPreferredName() | ||
+ "] should only be set if [" | ||
+ PRUNE_FIELD_NAME.getPreferredName() | ||
+ "] is set to true" | ||
); | ||
} | ||
|
||
this.prune = prune; | ||
this.pruningConfig = pruningConfig; | ||
} | ||
|
||
public Boolean getPrune() { | ||
return prune; | ||
} | ||
|
||
public TokenPruningConfig getPruningConfig() { | ||
return pruningConfig; | ||
} | ||
|
||
@Override | ||
public final boolean equals(Object other) { | ||
if (other == this) { | ||
return true; | ||
} | ||
|
||
if (other == null || getClass() != other.getClass()) { | ||
return false; | ||
} | ||
|
||
IndexOptions otherAsIndexOptions = (IndexOptions) other; | ||
return Objects.equals(prune, otherAsIndexOptions.prune) && Objects.equals(pruningConfig, otherAsIndexOptions.pruningConfig); | ||
} | ||
|
||
@Override | ||
public final int hashCode() { | ||
return Objects.hash(prune, pruningConfig); | ||
} | ||
|
||
@Override | ||
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException { | ||
builder.startObject(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Question here - if both |
||
|
||
if (prune != null) { | ||
builder.field(PRUNE_FIELD_NAME.getPreferredName(), prune); | ||
} | ||
if (pruningConfig != null) { | ||
builder.field(PRUNING_CONFIG_FIELD_NAME.getPreferredName(), pruningConfig); | ||
} | ||
|
||
builder.endObject(); | ||
return builder; | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a reminder, that we'll have to open a PR for 8.19 to update the appropriate asciidoc files as well 👍