|
| 1 | +--- |
| 2 | +id: inverted.md |
| 3 | +title: "INVERTED" |
| 4 | +summary: "The INVERTED index in Milvus is designed to accelerate filter queries on both scalar fields and structured JSON fields. By mapping terms to the documents or records that contain them, inverted indexes greatly improve query performance compared to brute-force searches." |
| 5 | +--- |
| 6 | + |
| 7 | +# INVERTED |
| 8 | + |
| 9 | +The `INVERTED` index in Milvus is designed to accelerate filter queries on both scalar fields and structured JSON fields. By mapping terms to the documents or records that contain them, inverted indexes greatly improve query performance compared to brute-force searches. |
| 10 | + |
| 11 | +## Overview |
| 12 | + |
| 13 | +Powered by [Tantivy](https://github.com/quickwit-oss/tantivy), Milvus implements inverted indexing to accelerate filter queries, especially for textual data. Here’s how it works: |
| 14 | + |
| 15 | +1. **Tokenize the Data**: Milvus takes your raw data—in this example, two sentences: |
| 16 | + |
| 17 | + - **"Milvus is a cloud-native vector database."** |
| 18 | + |
| 19 | + - **"Milvus is very good at performance."** |
| 20 | + |
| 21 | + and breaks them into unique words (e.g., *Milvus*, *is*, *cloud-native*, *vector*, *database*, *very*, *good*, *at*, *performance*). |
| 22 | + |
| 23 | +1. **Build the Term Dictionary**: These unique words are stored in a sorted list called the **Term Dictionary**. This dictionary lets Milvus quickly check if a word exists and locate its position in the index. |
| 24 | + |
| 25 | +1. **Create the Inverted List**: For each word in the Term Dictionary, Milvus keeps an **Inverted List** showing which documents contain that word. For instance, **"Milvus"** appears in both sentences, so its inverted list points to both document IDs. |
| 26 | + |
| 27 | + |
| 28 | + |
| 29 | +Because the dictionary is sorted, term-based filtering can be handled efficiently. Instead of scanning all documents, Milvus just looks up the term in the dictionary and retrieves its inverted list—significantly speeding up searches and filters on large datasets. |
| 30 | + |
| 31 | +## Index a regular scalar field |
| 32 | + |
| 33 | +For scalar fields like **BOOL**, **INT8**, **INT16**, **INT32**, **INT64**, **FLOAT**, **DOUBLE**, **VARCHAR**, and **ARRAY**, creating an inverted index is straightforward. Use the `create_index()` method with the `index_type` parameter set to `"INVERTED"`. |
| 34 | + |
| 35 | +```plaintext |
| 36 | +from pymilvus import MilvusClient |
| 37 | +
|
| 38 | +client = MilvusClient( |
| 39 | + uri="http://localhost:19530", |
| 40 | +) |
| 41 | +
|
| 42 | +index_params = client.create_index_params() # Prepare an empty IndexParams object, without having to specify any index parameters |
| 43 | +index_params.add_index( |
| 44 | + field_name="scalar_field_1", # Name of the scalar field to be indexed |
| 45 | + index_type="INVERTED", # Type of index to be created |
| 46 | + index_name="inverted_index" # Name of the index to be created |
| 47 | +) |
| 48 | +
|
| 49 | +client.create_index( |
| 50 | + collection_name="my_collection", # Specify the collection name |
| 51 | + index_params=index_params |
| 52 | +) |
| 53 | +``` |
| 54 | + |
| 55 | +## Index a JSON field |
| 56 | + |
| 57 | +Milvus extends its indexing capabilities to JSON fields, allowing you to efficiently filter on nested or structured data stored within a single column. Unlike scalar fields, when indexing a JSON field you must provide two additional parameters: |
| 58 | + |
| 59 | +- `json_path`**:** Specifies the nested key to index. |
| 60 | + |
| 61 | +- `json_cast_type`**:** Defines the data type (e.g., `"varchar"`, `"double"`, or `"bool"`) to which the extracted JSON value will be cast. |
| 62 | + |
| 63 | +For example, consider a JSON field named `metadata` with the following structure: |
| 64 | + |
| 65 | +```plaintext |
| 66 | +{ |
| 67 | + "metadata": { |
| 68 | + "product_info": { |
| 69 | + "category": "electronics", |
| 70 | + "brand": "BrandA" |
| 71 | + }, |
| 72 | + "price": 99.99, |
| 73 | + "in_stock": true, |
| 74 | + "tags": ["summer_sale", "clearance"] |
| 75 | + } |
| 76 | +} |
| 77 | +``` |
| 78 | + |
| 79 | +To create inverted indexes on specific JSON paths, you can use the following approach: |
| 80 | + |
| 81 | +```python |
| 82 | +index_params = client.prepare_index_params() |
| 83 | + |
| 84 | +# Example 1: Index the 'category' key inside 'product_info' as a string. |
| 85 | +index_params.add_index( |
| 86 | + field_name="metadata", # JSON field name |
| 87 | + index_type="INVERTED", # Specify the inverted index type |
| 88 | + index_name="json_index_1", # Custom name for this JSON index |
| 89 | + params={ |
| 90 | + "json_path": "metadata[\"product_info\"][\"category\"]", # Path to the 'category' key |
| 91 | + "json_cast_type": "varchar" # Cast the value as a string |
| 92 | + } |
| 93 | +) |
| 94 | + |
| 95 | +# Example 2: Index the 'price' key as a numeric type (double). |
| 96 | +index_params.add_index( |
| 97 | + field_name="metadata", # JSON field name |
| 98 | + index_type="INVERTED", |
| 99 | + index_name="json_index_2", # Custom name for this JSON index |
| 100 | + params={ |
| 101 | + "json_path": "metadata[\"price\"]", # Path to the 'price' key |
| 102 | + "json_cast_type": "double" # Cast the value as a double |
| 103 | + } |
| 104 | +) |
| 105 | + |
| 106 | +``` |
| 107 | + |
| 108 | +<table> |
| 109 | + <tr> |
| 110 | + <th><p>Parameter</p></th> |
| 111 | + <th><p>Description</p></th> |
| 112 | + <th><p>Example Value</p></th> |
| 113 | + </tr> |
| 114 | + <tr> |
| 115 | + <td><p><code>field_name</code></p></td> |
| 116 | + <td><p>Name of the JSON field in your schema.</p></td> |
| 117 | + <td><p><code>"metadata"</code></p></td> |
| 118 | + </tr> |
| 119 | + <tr> |
| 120 | + <td><p><code>index_type</code></p></td> |
| 121 | + <td><p>Index type to create; currently only <code>INVERTED</code> is supported for JSON path indexing.</p></td> |
| 122 | + <td><p><code>"INVERTED"</code></p></td> |
| 123 | + </tr> |
| 124 | + <tr> |
| 125 | + <td><p><code>index_name</code></p></td> |
| 126 | + <td><p>(Optional) A custom index name. Specify different names if you create multiple indexes on the same JSON field.</p></td> |
| 127 | + <td><p><code>"json_index_1"</code></p></td> |
| 128 | + </tr> |
| 129 | + <tr> |
| 130 | + <td><p><code>params.json_path</code></p></td> |
| 131 | + <td><p>Specifies which JSON path to index. You can target nested keys, array positions, or both (e.g., <code>metadata["product_info"]["category"]</code> or <code>metadata["tags"][0]</code>). |
| 132 | + If the path is missing or the array element does not exist for a particular row, that row is simply skipped during indexing, and no error is thrown.</p></td> |
| 133 | + <td><p><code>"metadata[\"product_info\"][\"category\"]"</code></p></td> |
| 134 | + </tr> |
| 135 | + <tr> |
| 136 | + <td><p><code>params.json_cast_type</code></p></td> |
| 137 | + <td><p>Data type that Milvus will cast the extracted JSON values to when building the index. Valid values:</p> |
| 138 | +<ul> |
| 139 | +<li><p><code>"bool"</code> or <code>"BOOL"</code></p></li> |
| 140 | +<li><p><code>"double"</code> or <code>"DOUBLE"</code></p></li> |
| 141 | +<li><p><code>"varchar"</code> or <code>"VARCHAR"</code></p> |
| 142 | +<p><strong>Note</strong>: For integer values, Milvus internally uses double for the index. Large integers above 2^53 lose precision. If the cast fails (due to type mismatch), no error is thrown, and that row’s value is not indexed.</p></li> |
| 143 | +</ul></td> |
| 144 | + <td><p><code>"varchar"</code></p></td> |
| 145 | + </tr> |
| 146 | +</table> |
| 147 | + |
| 148 | +## Considerations on JSON indexing |
| 149 | + |
| 150 | +- **Filtering logic**: |
| 151 | + |
| 152 | + - If you **create a double-type index** (`json_cast_type="double"`), only numeric-type filter conditions can use the index. If the filter compares a double index to a non-numeric condition, Milvus falls back to brute force search. |
| 153 | + |
| 154 | + - If you **create a varchar-type index** (`json_cast_type="varchar"`), only string-type filter conditions can use the index. Otherwise, Milvus falls back to brute force. |
| 155 | + |
| 156 | + - **Boolean** indexing behaves similarly to varchar-type. |
| 157 | + |
| 158 | +- **Term expressions**: |
| 159 | + |
| 160 | + - You can use `json["field"] in [value1, value2, …]`. However, the index works only for scalar values stored under that path. If `json["field"]` is an array, the query falls back to brute force (array-type indexing is not yet supported). |
| 161 | + |
| 162 | +- **Numeric precision**: |
| 163 | + |
| 164 | + - Internally, Milvus indexes all numeric fields as doubles. If a numeric value exceeds $2^{53}$, it loses precision, and queries on those out-of-range values may not match exactly. |
| 165 | + |
| 166 | +- **Data integrity**: |
| 167 | + |
| 168 | + - Milvus does not parse or transform JSON keys beyond your specified casting. If the source data is inconsistent (for example, some rows store a string for key `"k"` while others store a number), some rows will not be indexed. |
| 169 | + |
0 commit comments