Open
Description
We can ingest data from a file with the CLI. Currently, it produces splits generally with num docs << 10M which is our general optimal num docs. This comes from the fact that we commit a split every 60 seconds (default commit_timeout_secs
). This will lead to merges and will lower the indexing speed.
Ideally, we would need to produce split with num docs of 10M directly, this can be done by putting a high commit_timeout_secs
by default.
I suggest putting the commit timeout to 3600 seconds to avoid merges.