Skip to content

Commit c5aa10c

Browse files
committed
address comments
1 parent d445ee3 commit c5aa10c

File tree

3 files changed

+50
-33
lines changed

3 files changed

+50
-33
lines changed

docs/concepts/mlops/opensearch.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
Hopsworks includes OpenSearch as a multi-tenant service in projects.
22
OpenSearch provides vector database capabilities through its k-NN plugin, that supports the FAISS and nsmlib embedding indexes.
3-
Through Hopsworks, OpenSearch also provides enterprise capabilities, including authentication and access control to indexes (an index can be private to a Hopsworks project), filtering, scalability, high availability, and disaster recovery support.
3+
Through Hopsworks, OpenSearch also provides enterprise capabilities, including authentication and access control to indexes (an index can be private to a Hopsworks project), filtering, scalability, high availability, and disaster recovery support. To learn how Opensearch empowers vector similar search in Hopsworks, you can see [this guide](../../user_guides/fs/vector_similarity_search.md).
44

55
<img src="../../../assets/images/concepts/mlops/opensearch-knn.svg">

docs/user_guides/fs/index.md

+1
Original file line numberDiff line numberDiff line change
@@ -5,5 +5,6 @@ This section serves to provide guides and examples for the common usage of abstr
55
- [Storage Connectors](storage_connector/index.md)
66
- [Feature Groups](feature_group/index.md)
77
- [Feature Views](feature_view/index.md)
8+
- [Vector Similarity Search](vector_similarity_search.md)
89
- [Compute Engines](compute_engines.md)
910
- [Integrations](../integrations/index.md)
Original file line numberDiff line numberDiff line change
@@ -1,48 +1,50 @@
11
---
2-
description: Users guide about how to use Hopsworks for vector similarity search
2+
description: User guide for how to use vector similarity search in Hopsworks
33
---
44

55
# Introduction
6-
Vector similarity search is a robust technique enabling the retrieval of similar items based on their embeddings or representations. Its applications range across various domains, from recommendation systems to image similarity and beyond. In Hopsworks, this is facilitated through a vector database, such as Opensearch, which efficiently stores and retrieves relevant embeddings. In this guide, we'll walk you through the process of using Hopsworks for vector similarity search step by step.
6+
Vector similarity search is a technique enabling the retrieval of similar items based on their vector embeddings or representations. Its applications range across various domains, from recommendation systems to image similarity and beyond. In Hopsworks, vector similarity search is enabled by extending an online feature group with approximate nearest neighbor search capabilities through a vector database, such as Opensearch. This guide provides a detailed walkthrough on how to leverage Hopsworks for vector similarity search.
77

8-
# Ingesting Data into the Vector Database
9-
Hopsworks provides a user-friendly API for writing data to both online and offline feature stores. The example below illustrates the straightforward process of ingesting data into both the vector database and the offline feature store using a single insert method.
10-
Currently, Hopsworks supports Opensearch as a vector database.
8+
# Extending Feature Groups with Similarity Search
9+
In Hopsworks, each vector embedding in a feature group is stored in an index within the backing vector database. By default, vector embeddings are stored in the default index for the project (created for every project in Hopsworks), but you have the option to create a new index for a feature group if needed. Creating a separate index per feature group is particularly useful for large volumes of data, ensuring that when a feature group is deleted, its associated index is also removed. For feature groups that use the default project index, the index will only be removed when the project is deleted - not when the feature group is deleted. The index will store all the vector embeddings defined in that feature group, if you have more than one vector embedding in the feature group.
1110

12-
First, define the index and embedding features in the vector database. The project index will be used if no index name is provided.
11+
In the following example, we explicitly define an index for the feature group:
1312

1413
```aidl
1514
from hsfs import embedding
1615
17-
# Define the index
18-
emb = embedding.EmbeddingIndex(index_name=None)
16+
# Specify optionally the index in the vector database
17+
emb = embedding.EmbeddingIndex(index_name="news_fg")
1918
```
2019

21-
Then, add one or more embedding features to the index. Name and dimension of the embedding features are required for identifying which features should be indexed for k-nearest neighbor (KNN) search. Optionally, you can specify the [similarity function](https://github.com/logicalclocks/feature-store-api/blob/master/python/hsfs/embedding.py#L101).
20+
Then, add one or more embedding features to the index. Name and dimension of the embedding features are required for identifying which features should be indexed for k-nearest neighbor (KNN) search. In this example, we get the dimension of the embedding by taking the length of the value of the `embedding_heading` column in the first row of the dataframe `df`. Optionally, you can specify the [similarity function](TODO: add link).
2221
```aidl
23-
# Define the embedding feature
22+
# Add embedding feature to the index
2423
emb.add_embedding("embedding_heading", len(df["embedding_heading"][0]))
2524
```
2625

27-
Next, create a feature group with the `embedding_index` and ingest data to the feature group. When the `embedding_index` is provided, the vector database is used as online feature store. That is, all the features in the feature group are stored **exclusively** in the vector database. The advantage of storing all features in the vector database is that it enables similarity search, and filtering for all feature values.
26+
Next, you create a feature group with the `embedding_index` and ingest data to the feature group. When the `embedding_index` is provided, the vector database is used as online feature store. That is, all the features in the feature group are stored **exclusively** in the vector database. The advantage of storing all features in the vector database is that it enables similarity search, and filtering for all feature values.
2827

2928
```aidl
29+
# Create a feature group with the embedding index
3030
news_fg = fs.get_or_create_feature_group(
3131
name=f"news_fg",
32-
embedding_index=emb, # Specify the embedding index
33-
primary_key=["id1"],
32+
embedding_index=emb, # Provide the embedding index created
33+
primary_key=["news_id"],
3434
version=version,
35-
online_enabled=True,
36-
topic_name=f"news_fg_{version}_onlinefs"
35+
online_enabled=True
3736
)
3837
39-
# Ingest data into both the vector database and the offline feature store
40-
news_fg.insert(df, write_options={"start_offline_backfill": True})
38+
# Write a DataFrame to the feature group, including the offline store and the ANN index (in the Vector Database)
39+
news_fg.insert(df)
4140
```
4241

43-
# Querying Similar Embeddings
44-
The read API is designed for ease of use, enabling developers to seamlessly integrate similarity search into their applications. To retrieve features from the vector database, you only need to provide the target embedding as a search query using [`find_neighbors`](https://github.com/logicalclocks/feature-store-api/blob/master/python/hsfs/feature_group.py#L2141). It is also possible to filter features saved in the vector database.
42+
# Similarity Search for Feature Groups using Vector Embeddings
43+
You provide a vector embedding as a parameter to the search query using [`find_neighbors`](https://docs.hopsworks.ai/feature-store-api/{{{ hopsworks_version }}}/generated/api/feature_group_api/#find_neighbors), and it returns the rows in the online feature group that have vector embedding values most similar to the provided vector embedding.
4544

45+
It is also possible to filter rows by specifying a filter on any of the features in the feature group. The filter is pushed down to the vector database to improve query performance.
46+
47+
In the first code snippet below, `find_neighbor`s returns 3 rows in `news_fg` that have the closest `news_description` values to the provided `news_description`. In the second code snippet below, we only return news articles with a `newstype` of `sports`.
4648
```aidl
4749
# Search neighbor embedding with k=3
4850
news_fg.find_neighbors(model.encode(news_description), k=3)
@@ -51,46 +53,60 @@ news_fg.find_neighbors(model.encode(news_description), k=3)
5153
news_fg.find_neighbors(model.encode(news_description), k=3, filter=news_fg.newstype == "sports")
5254
```
5355

54-
To retrieve features at a specific time in the past from the offline database for analysis, you can utilize the offline read API to perform time travel.
55-
56+
To analyze feature values at specific points in time, you can utilize time travel functionality:
5657
```aidl
5758
# Time travel and read from the offline feature store
5859
news_fg.as_of(time_in_past).read()
5960
```
6061

61-
## Second Phase Reranking
62+
# Querying Similar Embeddings with Additional features
6263

63-
In some ML applications, second phase reranking of the top k items fetched by first phase filtering is common where extra features are required from other sources after fetching the k nearest items. In practice, it means that an extra step is needed to fetch the features from other feature groups in the online feature store. Hopsworks provides yet another simple read API for this purpose. Users can create a feature view by joining multiple feature groups and fetch all the required features by calling fv.find_neighbors. In the example below, view_cnt from another feature group is also returned to the result.
64+
You can also use similarity search for vector embedding features in feature views.
65+
In the code snippet below, we create a feature view by selecting features from the earlier `news_fg` and a new feature group `view_fg`. If you include a feature group with vector embedding features in a feature view, *regardless* if the vector embedding features are selected or not, you can call `find_neighbors` on the feature view, and it will return rows containing all the feature values in the feature view. In the example below, a list of `heading` and `view_cnt` will be returned for the news articles which are closet to provided `news_description`.
6466

6567
```aidl
6668
view_fg = fs.get_or_create_feature_group(
6769
name="view_fg",
68-
primary_key=["id1"],
70+
primary_key=["news_id"],
6971
version=version,
70-
online_enabled=True,
71-
topic_name=f"view_fg_{version}_onlinefs"
72+
online_enabled=True
7273
)
7374
7475
fv = fs.get_or_create_feature_view(
75-
"news_cnt", version=version,
76-
query=news_fg.select(["date", "heading", "newstype"]).join(view_fg.select(["view_cnt"])))
76+
"news_view", version=version,
77+
query=news_fg.select(["heading"]).join(view_fg.select(["view_cnt"]))
78+
)
7779
7880
fv.find_neighbors(model.encode(news_description), k=5)
7981
```
8082

83+
Note that you can use similarity search from the feature view only if the feature group which you are querying with `find_neighbors` has all the primary keys of the other feature groups. In the example above, you are querying against the feature group `news_fg` which has the vector embedding features, and it has the feature "news_id" which is the primary key of the feature group `view_fg`. But if `page_fg` is used as illustrated below, `find_neighbors` will fail to return any features because primary key `page_id` does not exist in `news_fg`.
84+
85+
<p align="center">
86+
<figure>
87+
<img src="../../../assets/images/guides/similarity_search/find_neighbors.png" alt="find neighbors">
88+
<figcaption>Cases when find_neighbors not works</figcaption>
89+
</figure>
90+
</p>
91+
8192
It is also possible to get back feature vector by providing the primary keys, but it is not recommended as explained in the next section. The client fetches feature vector from the vector store and the online store for `news_fg` and `view_fg` respectively.
8293
```aidl
83-
fv.get_feature_vectors({"id1": 1})
94+
fv.get_feature_vector({"news_id": 1})
8495
```
8596

8697
# Best Practices
87-
1. Choose the appropriate online feature stores
98+
1. Choose the Appropriate Online Feature Stores
8899

89100
There are 2 types of online feature stores in Hopsworks: online store (RonDB) and vector store (Opensearch). Online store is designed for retrieving feature vectors efficiently with low latency. Vector store is designed for finding similar embedding efficiently. If similarity search is not required, using online store is recommended for low latency retrieval of feature values including embedding.
90101

91-
2. Choose the features to store in vector store
102+
# Performance considerations for Feature Groups with Embeddings
103+
1. Choose features for vector store
92104

93105
While it is possible to update feature value in vector store, updating feature value in online store is more efficient. If you have features which are frequently being updated and do not require for filtering, consider storing them separately in a different feature group. As shown in the previous example, `view_cnt` is updated frequently and stored separately. You can then get all the required features by using feature view.
94106

107+
2. Use new index per feature group
108+
109+
Create a new index per feature group to optimize retrieval performance.
110+
95111
# Next step
96-
Explore the [notebook example](https://github.com/kennethmhc/news-search-knn-demo/blob/main/news-search-knn-demo.ipynb), demonstrating how to use Hopsworks for implementing a news search application. You can search for news using natural language in the application, powered by the Hopsworks vector database.
112+
Explore the [news search example](https://github.com/logicalclocks/hopsworks-tutorials/blob/master/api_examples/hsfs/knn_search/news-search-knn.ipynb), demonstrating how to use Hopsworks for implementing a news search application using natural language in the application. Additionally, you can see the application of querying similar embeddings with additional features in this [news rank example](https://github.com/logicalclocks/hopsworks-tutorials/blob/master/api_examples/hsfs/knn_search/news-search-rank-view.ipynb).

0 commit comments

Comments
 (0)