Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MB-60269 - Merge path fixes #202

Merged
merged 4 commits into from
Jan 5, 2024
Merged

MB-60269 - Merge path fixes #202

merged 4 commits into from
Jan 5, 2024

Conversation

metonymic-smokey
Copy link
Member

@metonymic-smokey metonymic-smokey commented Jan 4, 2024

This PR addresses 2 merge path issues -

A. IDs need to be added to a vector index in the same order as the vectors to maintain the integrity of the mapping between vectors and IDs which is important for both search and reconstruction.
When reconstructing indexes, the current approach uses the keys of a map to decide the order of the final vector IDs. However, since the Keys() function provides no guarantee of ordering of the returned list, the vectors and IDs are not guaranteed to be inserted in the same order.

B. Currently, merging flat indexes involves adding all existing flat indexes into an existing index.

  1. This indexes duplicate vectors - since a hash is unique to a vector, 2 segments with the same vectors in an index will have the same hash. merge_from() adds duplicate vectors to the index.
  2. This does not take into account updated doc IDs since those aren't updated in the constituent vector indexes being merged.
    This PR uses the reconstruction method for flat indexes too and rebuilds the flat index with updated data and IDs.

Testing -
Performed local testing with the SIFT Small dataset(10k vectors) with k = 2/3 for the following cases:

  1. recall - saw an increase in the number of common results with FTS and FAISS groundtruth vectors.
  2. results before and after a rebuild - remained the same.
  3. results when cloning an index - remained the same for both.

Picked the dataset since this has a reasonable recall with the current index configuration.

@metonymic-smokey metonymic-smokey changed the title Reconstruct flat indexes for merging Reconstruct flat indexes when merging Jan 4, 2024
@metonymic-smokey metonymic-smokey changed the title Reconstruct flat indexes when merging Merge path fixes Jan 4, 2024
@metonymic-smokey
Copy link
Member Author

metonymic-smokey commented Jan 4, 2024

TODO Open go-faiss PR to clean up MergeFrom() now, or let it remain in case we use it later?

@metonymic-smokey metonymic-smokey changed the title Merge path fixes MB-60269 - Merge path fixes Jan 4, 2024
Copy link
Member

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@metonymic-smokey like we spoke, would you add some backing tests or results at the least on how things looked with this change.

@blevesearch/collaborators must take another look.

@abhinavdangeti abhinavdangeti merged commit 8c668c4 into master Jan 5, 2024
6 checks passed
@abhinavdangeti abhinavdangeti deleted the m2 branch January 5, 2024 17:45
abhinavdangeti added a commit to blevesearch/bleve that referenced this pull request Jan 5, 2024
To include:
* 8c668c4 Aditi Ahuja | MB-60269 - Merge path fixes (blevesearch/zapx#202)
abhinavdangeti added a commit to blevesearch/bleve that referenced this pull request Jan 5, 2024
To include:
* 8c668c4 Aditi Ahuja | MB-60269 - Merge path fixes
(blevesearch/zapx#202)
moshaad7 pushed a commit that referenced this pull request Sep 12, 2024
* reconstruction for all index types

* correct ordering of vector IDs

* remove unused code

* fixed commentary
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants