|
| 1 | +--- |
| 2 | +description: Users guide about how to use Hopsworks for vector similarity search |
| 3 | +--- |
| 4 | + |
| 5 | +# Introduction |
| 6 | +Vector similarity search is a robust technique enabling the retrieval of similar items based on their embeddings or representations. Its applications range across various domains, from recommendation systems to image similarity and beyond. In Hopsworks, this is facilitated through a vector database, such as Opensearch, which efficiently stores and retrieves relevant embeddings. In this guide, we'll walk you through the process of using Hopsworks for vector similarity search step by step. |
| 7 | + |
| 8 | +# Ingesting Data into the Vector Database |
| 9 | +Hopsworks provides a user-friendly API for writing data to both online and offline feature stores. The example below illustrates the straightforward process of ingesting data into both the vector database and the offline feature store using a single insert method. |
| 10 | +Currently, Hopsworks supports Opensearch as a vector database. |
| 11 | + |
| 12 | +First, define the index and embedding features in the vector database. The project index will be used if no index name is provided. |
| 13 | + |
| 14 | +```aidl |
| 15 | +from hsfs import embedding |
| 16 | +
|
| 17 | +# Define the index |
| 18 | +emb = embedding.EmbeddingIndex(index_name=None) |
| 19 | +``` |
| 20 | + |
| 21 | +Then, add one or more embedding features to the index. Name and dimension of the embedding features are required for identifying which features should be indexed for k-nearest neighbor (KNN) search. Optionally, you can specify the [similarity function](https://github.com/logicalclocks/feature-store-api/blob/master/python/hsfs/embedding.py#L101). |
| 22 | +```aidl |
| 23 | +# Define the embedding feature |
| 24 | +emb.add_embedding("embedding_heading", len(df["embedding_heading"][0])) |
| 25 | +``` |
| 26 | + |
| 27 | +Next, create a feature group with the `embedding_index` and ingest data to the feature group. When the `embedding_index` is provided, the vector database is used as online feature store. That is, all the features in the feature group are stored **exclusively** in the vector database. The advantage of storing all features in the vector database is that it enables similarity search, and filtering for all feature values. |
| 28 | + |
| 29 | +```aidl |
| 30 | +news_fg = fs.get_or_create_feature_group( |
| 31 | + name=f"news_fg", |
| 32 | + embedding_index=emb, # Specify the embedding index |
| 33 | + primary_key=["id1"], |
| 34 | + version=version, |
| 35 | + online_enabled=True, |
| 36 | + topic_name=f"news_fg_{version}_onlinefs" |
| 37 | +) |
| 38 | +
|
| 39 | +# Ingest data into both the vector database and the offline feature store |
| 40 | +news_fg.insert(df, write_options={"start_offline_backfill": True}) |
| 41 | +``` |
| 42 | + |
| 43 | +# Querying Similar Embeddings |
| 44 | +The read API is designed for ease of use, enabling developers to seamlessly integrate similarity search into their applications. To retrieve features from the vector database, you only need to provide the target embedding as a search query using [`find_neighbors`](https://github.com/logicalclocks/feature-store-api/blob/master/python/hsfs/feature_group.py#L2141). It is also possible to filter features saved in the vector database. |
| 45 | + |
| 46 | +```aidl |
| 47 | +# Search neighbor embedding with k=3 |
| 48 | +news_fg.find_neighbors(model.encode(news_description), k=3) |
| 49 | +
|
| 50 | +# Filter and search |
| 51 | +news_fg.find_neighbors(model.encode(news_description), k=3, filter=news_fg.newstype == "sports") |
| 52 | +``` |
| 53 | + |
| 54 | +To retrieve features at a specific time in the past from the offline database for analysis, you can utilize the offline read API to perform time travel. |
| 55 | + |
| 56 | +```aidl |
| 57 | +# Time travel and read from the offline feature store |
| 58 | +news_fg.as_of(time_in_past).read() |
| 59 | +``` |
| 60 | + |
| 61 | +## Second Phase Reranking |
| 62 | + |
| 63 | +In some ML applications, second phase reranking of the top k items fetched by first phase filtering is common where extra features are required from other sources after fetching the k nearest items. In practice, it means that an extra step is needed to fetch the features from other feature groups in the online feature store. Hopsworks provides yet another simple read API for this purpose. Users can create a feature view by joining multiple feature groups and fetch all the required features by calling fv.find_neighbors. In the example below, view_cnt from another feature group is also returned to the result. |
| 64 | + |
| 65 | +```aidl |
| 66 | +view_fg = fs.get_or_create_feature_group( |
| 67 | + name="view_fg", |
| 68 | + primary_key=["id1"], |
| 69 | + version=version, |
| 70 | + online_enabled=True, |
| 71 | + topic_name=f"view_fg_{version}_onlinefs" |
| 72 | +) |
| 73 | +
|
| 74 | +fv = fs.get_or_create_feature_view( |
| 75 | + "news_cnt", version=version, |
| 76 | + query=news_fg.select(["date", "heading", "newstype"]).join(view_fg.select(["view_cnt"]))) |
| 77 | +
|
| 78 | +fv.find_neighbors(model.encode(news_description), k=5) |
| 79 | +``` |
| 80 | + |
| 81 | +It is also possible to get back feature vector by providing the primary keys, but it is not recommended as explained in the next section. The client fetches feature vector from the vector store and the online store for `news_fg` and `view_fg` respectively. |
| 82 | +```aidl |
| 83 | +fv.get_feature_vectors({"id1": 1}) |
| 84 | +``` |
| 85 | + |
| 86 | +# Best Practices |
| 87 | +1. Choose the appropriate online feature stores |
| 88 | + |
| 89 | +There are 2 types of online feature stores in Hopsworks: online store (RonDB) and vector store (Opensearch). Online store is designed for retrieving feature vectors efficiently with low latency. Vector store is designed for finding similar embedding efficiently. If similarity search is not required, using online store is recommended for low latency retrieval of feature values including embedding. |
| 90 | + |
| 91 | +2. Choose the features to store in vector store |
| 92 | + |
| 93 | +While it is possible to update feature value in vector store, updating feature value in online store is more efficient. If you have features which are frequently being updated and do not require for filtering, consider storing them separately in a different feature group. As shown in the previous example, `view_cnt` is updated frequently and stored separately. You can then get all the required features by using feature view. |
| 94 | + |
| 95 | +# Next step |
| 96 | +Explore the [notebook example](https://github.com/kennethmhc/news-search-knn-demo/blob/main/news-search-knn-demo.ipynb), demonstrating how to use Hopsworks for implementing a news search application. You can search for news using natural language in the application, powered by the Hopsworks vector database. |
0 commit comments