Skip to content

Our newest chunking benchmarks focused on context precision and context recall

Notifications You must be signed in to change notification settings

VectorChat/new-chunking-benchmarks

Repository files navigation

RAG Text Chunking Benchmark

This benchmark presents an evaluation of chunking (text segmentation) strategies with a RAG pipeline taking into account token-level retrieved contexts, based on the work "Evaluating Chunking Strategies for Retrieval" by Chroma. We use ragas for evaluation instead of their package.

For each text corpus in the evaluation dataset, we have pairs of question and reference context. Given a question, the task is to compare the difference between the retrieved context and the reference context. We evaluate the performance of different chunking strategies in terms of the quality of the retrieved context; we keep all other parts of the RAG pipeline the same.

Datasets can be found in the datasets directory.

  • chroma is taken directly from the original report

  • corpora_public is data that we gathered from wikitexts and openwebtext. We generated the questions using a similar methodology as outlined in the report above

Note

The latest results and visualizations are live on https://subnet.chunking.com/benchmarks

Methodology

  1. Chunking
    • Each chunker chunked each corpus individually, these chunks were then embedded individually and stored in Pinecone for later use.
    • Each chunker had a different namespace to ensure that different chunks were not mixed up.

Note

If chunker could not chunk a corpus properly, we exclude that corpus evaluation for the corresponding chunker.

  1. Retrieval considering context token limit (done for each question, using a combination of a vector database and reranker):
    • Query the vector database, using the question as the embedding, for the top rerank_buffer_size=400 chunks for a given question.
    • Pass all rerank_buffer_size chunks to a reranker (Cohere reranker v2).
    • At each token_limit from [50, 150, 250]
      • Aggregate top reranked chunks up to token_limit tokens. We used the aggregated text as the retrieved context for the question at that token_limit.
      • Thus, we used Cohere's reranker with the same parameters for all chunk methods in this benchmark.

Note

The cached retrieval results are available in the cached_retrieval/ directory

  1. Retrieval Evaluation (Benchmark):
    • Metrics: We use ragas's context precision and context recall configured with {llm: "gpt-4o-mini", embedding: "text-embedding-3-small}.
    • For each token_limit: we applied metrics on triplets of (question, reference context, and retrieved context above).
    • Repeat for all token_limit, we record metrics along with token_limit and average chunk size in token avg_chunk_size for all chunkers.

Diagram explaining chunking benchmark

Reproducibility

Set up

Set up credentials in .env file.

cp .env.example .env
# then fill in the credentials

Set up environment with the provided pyproject.toml file using uv.

uv sync
source .venv/bin/activate

Usage

Example run for benchmark on 2025-01-16 corpora_public dataset, using Cohere reranker.

python3 chunking_benchmark/chroma_benchmark.py --run_indexing --run_reranking --to_rerank --run_eval --chunk_dir results/25-01-16-corpora_public/chunks --path_questions datasets/corpora_public/questions.csv --rerank_method "cohere";

More options.

python3 chunking_benchmark/chroma_benchmark.py -h

About

Our newest chunking benchmarks focused on context precision and context recall

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published