RAG Text Chunking Benchmark

This benchmark presents an evaluation of chunking (text segmentation) strategies with a RAG pipeline taking into account token-level retrieved contexts, based on the work "Evaluating Chunking Strategies for Retrieval" by Chroma. We use ragas for evaluation instead of their package.

For each text corpus in the evaluation dataset, we have pairs of question and reference context. Given a question, the task is to compare the difference between the retrieved context and the reference context. We evaluate the performance of different chunking strategies in terms of the quality of the retrieved context; we keep all other parts of the RAG pipeline the same.

Datasets can be found in the datasets directory.

chroma is taken directly from the original report
corpora_public is data that we gathered from wikitexts and openwebtext. We generated the questions using a similar methodology as outlined in the report above

Note

The latest results and visualizations are live on https://subnet.chunking.com/benchmarks

Methodology

Chunking
- Each chunker chunked each corpus individually, these chunks were then embedded individually and stored in Pinecone for later use.
- Each chunker had a different namespace to ensure that different chunks were not mixed up.

Note

If chunker could not chunk a corpus properly, we exclude that corpus evaluation for the corresponding chunker.

Retrieval considering context token limit (done for each question, using a combination of a vector database and reranker):
- Query the vector database, using the question as the embedding, for the top rerank_buffer_size=400 chunks for a given question.
- Pass all rerank_buffer_size chunks to a reranker (Cohere reranker v2).
- At each token_limit from [50, 150, 250]
  - Aggregate top reranked chunks up to token_limit tokens. We used the aggregated text as the retrieved context for the question at that token_limit.
  - Thus, we used Cohere's reranker with the same parameters for all chunk methods in this benchmark.

Note

The cached retrieval results are available in the cached_retrieval/ directory

Retrieval Evaluation (Benchmark):
- Metrics: We use ragas's context precision and context recall configured with {llm: "gpt-4o-mini", embedding: "text-embedding-3-small}.
- For each token_limit: we applied metrics on triplets of (question, reference context, and retrieved context above).
- Repeat for all token_limit, we record metrics along with token_limit and average chunk size in token avg_chunk_size for all chunkers.

Reproducibility

Set up

Set up credentials in .env file.

cp .env.example .env
# then fill in the credentials

Set up environment with the provided pyproject.toml file using uv.

uv sync
source .venv/bin/activate

Usage

Example run for benchmark on 2025-01-16 corpora_public dataset, using Cohere reranker.

python3 chunking_benchmark/chroma_benchmark.py --run_indexing --run_reranking --to_rerank --run_eval --chunk_dir results/25-01-16-corpora_public/chunks --path_questions datasets/corpora_public/questions.csv --rerank_method "cohere";

More options.

python3 chunking_benchmark/chroma_benchmark.py -h

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets/images		assets/images
cached_retrieval		cached_retrieval
chunking_benchmark		chunking_benchmark
datasets		datasets
docs		docs
results		results
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Text Chunking Benchmark

Methodology

Reproducibility

Set up

Usage

About

Releases

Packages

Languages

VectorChat/new-chunking-benchmarks

Folders and files

Latest commit

History

Repository files navigation

RAG Text Chunking Benchmark

Methodology

Reproducibility

Set up

Usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages