Skip to content

Vector test tools #128934

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 6, 2025
Merged

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Jun 4, 2025

This adds some testing tools for verifying vector recall and latency directly without having to spin up an entire ES node and running a rally track.

Its pretty barebones and takes inspiration from lucene-util, but I wanted access to our own formats and tooling to make our lives easier.

Here is an example config file. This will build the initial index, run queries at num_candidates: 50, then again at num_candidates 100 (without reindexing, and re-using the cached nearest neighbors).

[{
  "doc_vectors" : "path",
  "query_vectors" : "path",
  "num_docs" : 10000,
  "num_queries" : 10,
  "index_type" : "hnsw",
  "num_candidates" : 50,
  "k" : 10,
  "hnsw_m" : 16,
  "hnsw_ef_construction" : 200,
  "index_threads" : 4,
  "reindex" : true,
  "force_merge" : false,
  "vector_space" : "maximum_inner_product",
  "dimensions" : 768
},
{
"doc_vectors" : "path",
"query_vectors" : "path",
"num_docs" : 10000,
"num_queries" : 10,
"index_type" : "hnsw",
"num_candidates" : 100,
"k" : 10,
"hnsw_m" : 16,
"hnsw_ef_construction" : 200,
"vector_space" : "maximum_inner_product",
"dimensions" : 768
}
]

To execute:

./gradlew :qa:vector:checkVec --args="/Path/to/knn_tester_config.json"

Calling ./gradlew :qa:vector:checkVecHelp gives some guidance on how to use it, additionally providing a way to run it via java directly (useful to bypass gradlew guff).

@benwtrent benwtrent marked this pull request as ready for review June 5, 2025 13:56
@elasticsearchmachine elasticsearchmachine added Team:Delivery Meta label for Delivery team Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch labels Jun 5, 2025
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-relevance (Team:Search Relevance)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-delivery (Team:Delivery)

Copy link
Contributor

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this quickly and didn't have any immediate concerns. Agree this would be super valuable. +1 to get it in and iterate / refine.

if (Files.exists(indexPath)) {
logger.debug("KnnIndexer: existing index at %s", indexPath);
} else {
Files.createDirectory(indexPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should change this to Files.createDirectories to ensure that all parent directories would be created if needed? (I was met with a NoSuchFileException when I first ran this)

cmdLineArgs.vectorSpace(),
cmdLineArgs.numDocs()
);
if (cmdLineArgs.reindex()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: We could update this to check if the directory already exists. If for example I fail to specify "reindex=true" for an unseen vec file, I would be met with an exception when trying to open the dir

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems sane, adding it

Integer.toString(result.numDocs),
Long.toString(result.indexTimeMS),
Long.toString(result.forceMergeTimeMS),
Integer.toString(result.numSegments),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor again:

index_type  num_docs  index_time(ms)  force_merge_time(ms)  num_segments  latency(ms)      QPS  recall  visited
----------  --------  --------------  --------------------  ------------  -----------  -------  ------  -------  
hnsw           10000             421                     0             4         0.40  2500.00    1.00   174.00
hnsw           10000               0                     0             0         0.50  2000.00    1.00   174.00

The num_docs and num_segments could follow previous values if reindex=false

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add whatever we want. Keeping it simple.

// printout an example configuration formatted file and indicate that it is required
System.out.println("Usage: java -cp <your-classpath> org.elasticsearch.test.knn.KnnIndexTester <config-file>");
System.out.println("Where <config-file> is a JSON file containing one or more configurations for the KNN index tester.");
System.out.println("An example configuration object: ");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: we could have just a sample of what the doc/query files look like

Copy link
Contributor

@pmpailis pmpailis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really awesome ❤️ Some minor comments only but mainly for next iterations. Nothing blocking.

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jun 6, 2025
Copy link
Member

@carlosdelest carlosdelest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So nice! 🥳

@elasticsearchmachine elasticsearchmachine merged commit 155c0da into elastic:main Jun 6, 2025
18 checks passed
@benwtrent benwtrent deleted the vector-test-tools branch June 6, 2025 16:07
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 9, 2025
This adds some testing tools for verifying vector recall and latency
directly without having to spin up an entire ES node and running a rally
track.

Its pretty barebones and takes inspiration from lucene-util, but I
wanted access to our own formats and tooling to make our lives easier.

Here is an example config file. This will build the initial index, run
queries at num_candidates: 50, then again at num_candidates 100 (without
reindexing, and re-using the cached nearest neighbors).

```
[{
  "doc_vectors" : "path",
  "query_vectors" : "path",
  "num_docs" : 10000,
  "num_queries" : 10,
  "index_type" : "hnsw",
  "num_candidates" : 50,
  "k" : 10,
  "hnsw_m" : 16,
  "hnsw_ef_construction" : 200,
  "index_threads" : 4,
  "reindex" : true,
  "force_merge" : false,
  "vector_space" : "maximum_inner_product",
  "dimensions" : 768
},
{
"doc_vectors" : "path",
"query_vectors" : "path",
"num_docs" : 10000,
"num_queries" : 10,
"index_type" : "hnsw",
"num_candidates" : 100,
"k" : 10,
"hnsw_m" : 16,
"hnsw_ef_construction" : 200,
"vector_space" : "maximum_inner_product",
"dimensions" : 768
}
]
```

To execute:

```
./gradlew :qa:vector:checkVec --args="/Path/to/knn_tester_config.json"
```

Calling `./gradlew :qa:vector:checkVecHelp` gives some guidance on how
to use it, additionally providing a way to run it via java directly
(useful to bypass gradlew guff).
mridula-s109 pushed a commit to mridula-s109/elasticsearch that referenced this pull request Jun 9, 2025
This adds some testing tools for verifying vector recall and latency
directly without having to spin up an entire ES node and running a rally
track.

Its pretty barebones and takes inspiration from lucene-util, but I
wanted access to our own formats and tooling to make our lives easier.

Here is an example config file. This will build the initial index, run
queries at num_candidates: 50, then again at num_candidates 100 (without
reindexing, and re-using the cached nearest neighbors).

```
[{
  "doc_vectors" : "path",
  "query_vectors" : "path",
  "num_docs" : 10000,
  "num_queries" : 10,
  "index_type" : "hnsw",
  "num_candidates" : 50,
  "k" : 10,
  "hnsw_m" : 16,
  "hnsw_ef_construction" : 200,
  "index_threads" : 4,
  "reindex" : true,
  "force_merge" : false,
  "vector_space" : "maximum_inner_product",
  "dimensions" : 768
},
{
"doc_vectors" : "path",
"query_vectors" : "path",
"num_docs" : 10000,
"num_queries" : 10,
"index_type" : "hnsw",
"num_candidates" : 100,
"k" : 10,
"hnsw_m" : 16,
"hnsw_ef_construction" : 200,
"vector_space" : "maximum_inner_product",
"dimensions" : 768
}
]
```

To execute:

```
./gradlew :qa:vector:checkVec --args="/Path/to/knn_tester_config.json"
```

Calling `./gradlew :qa:vector:checkVecHelp` gives some guidance on how
to use it, additionally providing a way to run it via java directly
(useful to bypass gradlew guff).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Delivery/Build Build or test infrastructure >non-issue :Search Relevance/Vectors Vector search Team:Delivery Meta label for Delivery team Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch v9.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants