This benchmark presents an evaluation of chunking (text segmentation) strategies with a RAG pipeline taking into account token-level retrieved contexts, based on the work "Evaluating Chunking Strategies for Retrieval" by Chroma. We use ragas
for evaluation instead of their package.
For each text corpus in the evaluation dataset, we have pairs of question and reference context. Given a question, the task is to compare the difference between the retrieved context and the reference context. We evaluate the performance of different chunking strategies in terms of the quality of the retrieved context; we keep all other parts of the RAG pipeline the same.
Datasets can be found in the datasets directory.
-
chroma
is taken directly from the original report -
corpora_public
is data that we gathered from wikitexts and openwebtext. We generated the questions using a similar methodology as outlined in the report above
Note
The latest results and visualizations are live on https://subnet.chunking.com/benchmarks
- Chunking
- Each chunker chunked each corpus individually, these chunks were then embedded individually and stored in Pinecone for later use.
- Each chunker had a different namespace to ensure that different chunks were not mixed up.
Note
If chunker could not chunk a corpus properly, we exclude that corpus evaluation for the corresponding chunker.
- Retrieval considering context token limit (done for each question, using a combination of a vector database and reranker):
- Query the vector database, using the question as the embedding, for the top
rerank_buffer_size=400
chunks for a given question. - Pass all
rerank_buffer_size
chunks to a reranker (Cohere reranker v2). - At each
token_limit
from[50, 150, 250]
- Aggregate top reranked chunks up to
token_limit
tokens. We used the aggregated text as the retrieved context for the question at thattoken_limit
. - Thus, we used Cohere's reranker with the same parameters for all chunk methods in this benchmark.
- Aggregate top reranked chunks up to
- Query the vector database, using the question as the embedding, for the top
Note
The cached retrieval results are available in the cached_retrieval/
directory
- Retrieval Evaluation (Benchmark):
- Metrics: We use
ragas
's context precision and context recall configured with{llm: "gpt-4o-mini", embedding: "text-embedding-3-small}
. - For each
token_limit
: we applied metrics on triplets of (question, reference context, and retrieved context above). - Repeat for all
token_limit
, we record metrics along withtoken_limit
and average chunk size in tokenavg_chunk_size
for all chunkers.
- Metrics: We use
Set up credentials in .env
file.
cp .env.example .env
# then fill in the credentials
Set up environment with the provided pyproject.toml
file using uv
.
uv sync
source .venv/bin/activate
Example run for benchmark on 2025-01-16 corpora_public
dataset, using Cohere reranker.
python3 chunking_benchmark/chroma_benchmark.py --run_indexing --run_reranking --to_rerank --run_eval --chunk_dir results/25-01-16-corpora_public/chunks --path_questions datasets/corpora_public/questions.csv --rerank_method "cohere";
More options.
python3 chunking_benchmark/chroma_benchmark.py -h