[FSTORE-1238] Add guide for similarity search #349

kennethmhc · 2024-02-14T10:55:04Z

No description provided.

robzor92 · 2024-02-14T14:07:45Z

docs/user_guides/fs/vector_similarity_search.md

+While it is possible to update feature value in vector store, updating feature value in online store is more efficient. If you have features which are frequently being updated and do not require for filtering, consider storing them separately in a different feature group. As shown in the previous example, `view_cnt` is updated frequently and stored separately. You can then get all the required features by using feature view.
+
+# Next step
+Explore the [notebook example](https://github.com/kennethmhc/news-search-knn-demo/blob/main/news-search-knn-demo.ipynb), demonstrating how to use Hopsworks for implementing a news search application. You can search for news using natural language in the application, powered by the Hopsworks vector database.


This should be a link to a tutorial in hopsworks-tutorials and not a personal repo. I believe Maksym is working on one judging by this jira: https://hopsworks.atlassian.net/browse/FSTORE-1239

Can you add a header for this section (from line 93) - like
"Performance considerations for Feature Groups with Embeddings"
?

Also, this line is only correct if news_fg is the label FG:
"You can then get all the required features by using feature view."
So, i would omit it until we get our story straight around embeddings in feature views.

fv do returns all feature as long as it has the primary key of other feature groups

docs/user_guides/fs/vector_similarity_search.md

jimdowling

We need to get our story straight around embeddings in feature views.
I don't think the API is clear right now - maybe docs can fix it, but we should discuss.

jimdowling · 2024-02-14T13:47:50Z

docs/user_guides/fs/vector_similarity_search.md

@@ -0,0 +1,96 @@
+---
+description: Users guide about how to use Hopsworks for vector similarity search 


User guide for how to use vector similarity search in Hopsworks

jimdowling · 2024-02-14T13:48:59Z

docs/user_guides/fs/vector_similarity_search.md

+---
+
+# Introduction
+Vector similarity search is a robust technique enabling the retrieval of similar items based on their embeddings or representations. Its applications range across various domains, from recommendation systems to image similarity and beyond. In Hopsworks, this is facilitated through a vector database, such as Opensearch, which efficiently stores and retrieves relevant embeddings. In this guide, we'll walk you through the process of using Hopsworks for vector similarity search step by step.


Let's use the term "vector embeddings" rather than "embeddings" in the document.
The community don't use the term embeddings by itself so much any more.

I would prefer to describe here how vector embeddings is part of feature groups.

In Hopsworks, a vector embedding extends an online feature group with approximate nearest neighbor search capability.

jimdowling · 2024-02-14T13:58:28Z

docs/user_guides/fs/vector_similarity_search.md

+# Introduction
+Vector similarity search is a robust technique enabling the retrieval of similar items based on their embeddings or representations. Its applications range across various domains, from recommendation systems to image similarity and beyond. In Hopsworks, this is facilitated through a vector database, such as Opensearch, which efficiently stores and retrieves relevant embeddings. In this guide, we'll walk you through the process of using Hopsworks for vector similarity search step by step.
+
+# Ingesting Data into the Vector Database


Extending Feature Groups with Similarity Search

jimdowling · 2024-02-14T13:59:03Z

docs/user_guides/fs/vector_similarity_search.md

+Vector similarity search is a robust technique enabling the retrieval of similar items based on their embeddings or representations. Its applications range across various domains, from recommendation systems to image similarity and beyond. In Hopsworks, this is facilitated through a vector database, such as Opensearch, which efficiently stores and retrieves relevant embeddings. In this guide, we'll walk you through the process of using Hopsworks for vector similarity search step by step.
+
+# Ingesting Data into the Vector Database
+Hopsworks provides a user-friendly API for writing data to both online and offline feature stores. The example below illustrates the straightforward process of ingesting data into both the vector database and the offline feature store using a single insert method. 


This 'adds no value or new info', so remove:
"Hopsworks provides a user-friendly API for writing data to both online and offline feature stores."

From a user perspective, they are not writing to the vector DB. They are writing to a FG. The FG is now indexed for ANN. That is how we should describe it.

"of ingesting data into both the vector database and the offline feature store using a single insert method. " ->
of ingesting data into feature group (online and offline) and indexing it for approximate nearest neighbor (ANN) search.

jimdowling · 2024-02-14T14:02:52Z

docs/user_guides/fs/vector_similarity_search.md

+
+# Ingesting Data into the Vector Database
+Hopsworks provides a user-friendly API for writing data to both online and offline feature stores. The example below illustrates the straightforward process of ingesting data into both the vector database and the offline feature store using a single insert method. 
+Currently, Hopsworks supports Opensearch as a vector database. 


Currently, feature groups support an ANN index with Opensearch as the vector database.

What about FAIS or nmslib?
Can i choose which ANN index to use?
We should document it here.

It is not configurable right now. I can include the default setting.

Method: hnsw Engine: nmslib

jimdowling · 2024-02-14T21:02:47Z

docs/user_guides/fs/vector_similarity_search.md

+
+## Second Phase Reranking
+
+In some ML applications, second phase reranking of the top k items fetched by first phase filtering is common where extra features are required from other sources after fetching the k nearest items. In practice, it means that an extra step is needed to fetch the features from other feature groups in the online feature store. Hopsworks provides yet another simple read API for this purpose. Users can create a feature view by joining multiple feature groups and fetch all the required features by calling fv.find_neighbors. In the example below, view_cnt from another feature group is also returned to the result.


We are not teaching about recommender systems. We are telling them they can use similarity search in feature views. I would remove this text:
"In some ML applications, second phase reranking of the top k items fetched by first phase filtering is common where extra features are required from other sources after fetching the k nearest items. In practice, it means that an extra step is needed to fetch the features from other feature groups in the online feature store. Hopsworks provides yet another simple read API for this purpose.Users can create a feature view by joining multiple feature groups and fetch all the required features by calling fv.find_neighbors. In the example below, view_cnt from another feature group is also returned to the result."

You can also use similarity search for vector embedding features in feature views.
In the code snippet below, we create a feature view by selecting features from the earlier news_fg and a new feature group view_fg. If you include a vector embedding feature from a feature group in a feature view, you can call find_neighbors on the feature view, and it will return rows containing all of the feature values in the feature view.

The purpose is to say: "You can also use find_neighbors in feature view". Second phase reranking is just a motivate example. It is just one sentence.

jimdowling · 2024-02-14T21:11:33Z

docs/user_guides/fs/vector_similarity_search.md

+
+fv.find_neighbors(model.encode(news_description), k=5)
+```
+


I think we will have to discuss this. The semantics of using similarity search in feature views is problematic, as feature groups with vector embeddings are not stored in the online store.

I assume you can only return the features from the vector embedding feature group - not from the other feature groups that make up the feature view.

I think we probably should leave out this section on using similarity search in feature views until we figure out exactly what the semantics of it are and should be.

I assume you can only return the features from the vector embedding feature group - not from the other feature groups that make up the feature view.

No fv.find_neighbors return features from other feature group also, view_fg is stored in RonDb.
The purpose of fv.find_neighbor is

select features from embedding feature group, some features may be just stored for filtering purpose and is not part of the model features.

save users' effort by combining 2 steps: 1. getting similar items in feature group and then 2. get feature vector from feature view, into a single method.

jimdowling · 2024-02-15T07:03:49Z

docs/user_guides/fs/vector_similarity_search.md

+While it is possible to update feature value in vector store, updating feature value in online store is more efficient. If you have features which are frequently being updated and do not require for filtering, consider storing them separately in a different feature group. As shown in the previous example, `view_cnt` is updated frequently and stored separately. You can then get all the required features by using feature view.
+
+# Next step
+Explore the [notebook example](https://github.com/kennethmhc/news-search-knn-demo/blob/main/news-search-knn-demo.ipynb), demonstrating how to use Hopsworks for implementing a news search application. You can search for news using natural language in the application, powered by the Hopsworks vector database.


Can you add a header for this section (from line 93) - like
"Performance considerations for Feature Groups with Embeddings"
?

jimdowling · 2024-02-15T07:04:49Z

docs/user_guides/fs/vector_similarity_search.md

+While it is possible to update feature value in vector store, updating feature value in online store is more efficient. If you have features which are frequently being updated and do not require for filtering, consider storing them separately in a different feature group. As shown in the previous example, `view_cnt` is updated frequently and stored separately. You can then get all the required features by using feature view.
+
+# Next step
+Explore the [notebook example](https://github.com/kennethmhc/news-search-knn-demo/blob/main/news-search-knn-demo.ipynb), demonstrating how to use Hopsworks for implementing a news search application. You can search for news using natural language in the application, powered by the Hopsworks vector database.


Also, this line is only correct if news_fg is the label FG:
"You can then get all the required features by using feature view."
So, i would omit it until we get our story straight around embeddings in feature views.

docs/user_guides/fs/vector_similarity_search.md

jimdowling

Just a few small changes

jimdowling · 2024-03-13T08:07:32Z

docs/user_guides/fs/vector_similarity_search.md

+---
+
+# Introduction
+Vector similarity search is a technique enabling the retrieval of similar items based on their vector embeddings or representations. Its applications range across various domains, from recommendation systems to image similarity and beyond. In Hopsworks, vector similarity search is enabled by extending an online feature group with approximate nearest neighbor search capabilities through a vector database, such as Opensearch. This guide provides a detailed walkthrough on how to leverage Hopsworks for vector similarity search.


Are we consistent in calling it "vector similarity search" in all the docs?

Then i would add -
Vector similarity search (also called similarity search)

Or just change it to "vector similarity search" everywhere in the docs.

jimdowling · 2024-03-13T08:10:32Z

docs/user_guides/fs/vector_similarity_search.md

+emb = embedding.EmbeddingIndex(index_name="news_fg")
+```
+
+Then, add one or more embedding features to the index. Name and dimension of the embedding features are required for identifying which features should be indexed for k-nearest neighbor (KNN) search. In this example, we get the dimension of the embedding by taking the length of the value of the `embedding_heading` column in the first row of the dataframe `df`. Optionally, you can specify the similarity function.


This is not clear:
"Optionally, you can specify the similarity function."
You should add a couple of examples.
Optionally, you can specify the similarity function, for example, cosine or ...

jimdowling · 2024-03-13T08:11:04Z

docs/user_guides/fs/vector_similarity_search.md

+emb.add_embedding("embedding_heading", len(df["embedding_heading"][0]))
+```
+
+Next, you create a feature group with the `embedding_index` and ingest data to the feature group. When the `embedding_index` is provided, the vector database is used as online feature store. That is, all the features in the feature group are stored **exclusively** in the vector database. The advantage of storing all features in the vector database is that it enables similarity search, and filtering for all feature values.


filtering => push-down filtering

* add guide * address comments * fix style * combine best practices and performance considerations * add image * address comment

robzor92 requested changes Feb 14, 2024

View reviewed changes

jimdowling requested changes Feb 15, 2024

View reviewed changes

kennethmhc added 5 commits March 7, 2024 15:42

add guide

2fb4fb7

address comments

24866f4

fix style

85704c3

combine best practices and performance considerations

5d9d79b

add image

3503d74

kennethmhc force-pushed the FSTORE-1238 branch from d20522d to 3503d74 Compare March 7, 2024 14:43

robzor92 approved these changes Mar 7, 2024

View reviewed changes

jimdowling approved these changes Mar 13, 2024

View reviewed changes

address comment

0f5ea9d

kennethmhc merged commit cba0f8a into logicalclocks:main Mar 13, 2024
1 check passed

SirOibaf pushed a commit that referenced this pull request Mar 17, 2024

[FSTORE-1238] Add guide for similarity search (#349)

d8acaac

* add guide * address comments * fix style * combine best practices and performance considerations * add image * address comment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FSTORE-1238] Add guide for similarity search #349

[FSTORE-1238] Add guide for similarity search #349

kennethmhc commented Feb 14, 2024

robzor92 Feb 14, 2024

jimdowling Feb 15, 2024

jimdowling Feb 15, 2024

kennethmhc Feb 15, 2024

jimdowling left a comment

jimdowling Feb 14, 2024

jimdowling Feb 14, 2024

jimdowling Feb 14, 2024

jimdowling Feb 14, 2024

jimdowling Feb 14, 2024

jimdowling Feb 14, 2024

jimdowling Feb 14, 2024

jimdowling Feb 14, 2024

kennethmhc Feb 15, 2024

jimdowling Feb 14, 2024

jimdowling Feb 14, 2024

kennethmhc Feb 15, 2024

jimdowling Feb 14, 2024

kennethmhc Feb 15, 2024

jimdowling Feb 15, 2024

jimdowling Feb 15, 2024

jimdowling left a comment

jimdowling Mar 13, 2024

jimdowling Mar 13, 2024

jimdowling Mar 13, 2024

jimdowling Mar 13, 2024

jimdowling Mar 13, 2024

		@@ -0,0 +1,96 @@
		---
		description: Users guide about how to use Hopsworks for vector similarity search


		## Second Phase Reranking

		In some ML applications, second phase reranking of the top k items fetched by first phase filtering is common where extra features are required from other sources after fetching the k nearest items. In practice, it means that an extra step is needed to fetch the features from other feature groups in the online feature store. Hopsworks provides yet another simple read API for this purpose. Users can create a feature view by joining multiple feature groups and fetch all the required features by calling fv.find_neighbors. In the example below, view_cnt from another feature group is also returned to the result.


		fv.find_neighbors(model.encode(news_description), k=5)
		```

[FSTORE-1238] Add guide for similarity search #349

[FSTORE-1238] Add guide for similarity search #349

Conversation

kennethmhc commented Feb 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimdowling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jimdowling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment