ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

Table of Contents: Installation | Requirements | Quick Start | Citation

ARES is a groundbreaking framework for evaluating Retrieval-Augmented Generation (RAG) models. The automated process combines synthetic data generation with fine-tuned classifiers to efficiently assess context relevance, answer faithfulness, and answer relevance, minimizing the need for extensive human annotations. ARES employs synthetic query generation and Precision-Performance Iteration (PPI), providing accurate evaluations with statistical confidence.

💬 Mini Q&A

What does ARES assess in RAG models?

ARES conducts a comprehensive evaluation of Retrieval-Augmented Generation (RAG) models, assessing the systems for context relevance, answer faithfulness, and answer relevance. This thorough assessment ensures a complete understanding of the performance of the RAG system.

How does ARES automate the evaluation process?

ARES minimizes the need for human labeling by leveraging fine-tuned classifiers and synthetic data. Its PPI component, Prediction-Powered inference, refines evaluations considering model response variability and provides statistical confidence in the results. By using fine-tuned classifiers and synthetically generated data, ARES cuts down on human labeling needs while providing accurate assessments.

Can ARES handle my custom RAG model?

Yes, ARES is a model-agnostic tool that enables you to generate synthetic queries and answers from your documents. With ARES, you can evaluate these generated queries and answers from your RAG model.

⚙️ Installation

To install ARES, run the following commands:

pip install ares-ai

Optional: Initalize OpenAI or TogetherAI API key with the following command:

export OPENAI_API_KEY=<your key here>
export TOGETHER_API_KEY=<your key here>

📝 Requirements

To implement ARES for scoring your RAG system and comparing to other RAG configurations, you need three components:

A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal.
A set of few-shot examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system
A much larger set of unlabeled query-document-answer triples outputted by your RAG system for scoring

🚀 Quick Start - #1

To get started with ARES, you'll need to set up your configuration. Below is an example of a configuration for ARES!

Copy-paste each step to see ARES in action!

Run the following to get the files for quick-start! It includes a few_shot_prompt file, a labeled and unlabeled dataset!

wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets/nq_few_shot_prompt_v1.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_unlabeled_output.tsv

*Note: You can run the following command to get the full NQ dataset!

from ares import ARES
ares = ARES() 
ares.KILT_dataset("nq")

# Fetches NQ datasets with ratios including 0.5, 0.6, 0.7, etc.
# For purposes of our quick start guide, we rename nq_ratio_0.5 to nq_labeled_output and nq_ratio_0.6 to nq_unlabeled_output.

Step 1) Run the following to see GPT 3.5's accuracy on the NQ unlabeled dataset!

from ares import ARES

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_v1.tsv",
    
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv", 

    "model_choice" : "gpt-3.5-turbo-0125"
} 

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

Step 2) Run the following to see ARES's synthetic generation in action!

from ares import ARES

synth_config = { 
    "document_filepaths": "nq_labeled_output.tsv",
    "few_shot_prompt_filename": "nq_few_shot_prompt_v1.tsv",
    "synthetic_queries_filename": "data/output/synthetic_queries_1.tsv",
    "documents_sampled": 10000
}

ares_module = ARES(synthetic_query_generator=synth_config)
results = ares_module.generate_synthetic_data()
print(results)

Step 3) Run the following to see ARES's training classifier in action!

from ares import ARES

classifier_config = {
    "training_dataset": "output/synthetic_queries_1.tsv", 
    "validation_set": "nq_labeled_output.tsv", 
    "label_column": "Answer_Relevance_Label", 
    "num_epochs": 10, 
    "patience_value": 3, 
    "learning_rate": 5e-6
}

ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)

Step 4) Run the following to see ARES's PPI in action!

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_labeled_output.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_v1.tsv",
    "checkpoints": ["/checkpoints/microsoft-deberta-v3-large/output-synthetic_queries_1.tsv/5e-06_1_True_Context_Relevance_Label_ratio_0.6_reformatted_full_articles_False_validation_with_negatives_428380.pt"], # CHANGE THIS
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_unlabeled_output.tsv", 
}

ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)

For more details, refer to our documentation.

🚀 Quick Start - #2

To get started with ARES's PPI, you'll need to set up your configuration. Below is an example of a configuration for ARES!

Just copy-paste and you'll be good to go!

Step 1) Download necessary datasets

wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets/nq_few_shot_prompt_v1.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_unlabeled_output.tsv

Step 2) Run the following to retrive the UES/IDP scores with GPT3.5!

from ares import ARES

ues_idp_config = {
    # Dataset for in-domain prompts
    "in_domain_prompts_dataset": "nq_few_shot_prompt_v1.tsv",
    
    # Dataset for unlabeled evaluation
    "unlabeled_evaluation_set": "nq_unlabeled_output.tsv", 

    "model_choice" : "gpt-3.5-turbo-0125"
} 

ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}

Step 3) Run the following to retrive ARES's PPI scores with GPT3.5!

Results Replication

We include synthetic datasets for key experimental results in synthetic_datasets. The few-shot prompts used for generation and evaluation are included in datasets. We also include instructions for fine-tuning LLM judges in the paper itself. Please reach out to jonsaadfalcon@stanford.edu or manihani@stanford.edu if you have any further questions.

Citation

To cite our work, please use the following Bibtex:

@misc{saadfalcon2023ares,
      title={ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems}, 
      author={Jon Saad-Falcon and Omar Khattab and Christopher Potts and Matei Zaharia},
      year={2023},
      eprint={2311.09476},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Appendix

Machine requirements and setup when not using OpenAI API

Machine requirements

Over ~100 GB of available disk space
GPU
- Should work: A100 (e.g. Standard_NC24ads_A100_v4 on Azure)
- Does not work:
  - Tested on 2023-12-17 with both Standard_NC6s_v3 and Standard_NC12s_v3, and ran into this error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.77 GiB total capacity; 15.12 GiB already allocated; 95.44 MiB free; 15.12 GiB reserved in total by PyTorch)

Machine setup

For example, on an Azure VM running Linux (ubuntu 20.04), you will need to do the following:

Install conda
- First set of commands (can copy-paste multiple lines)
  - wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
  - chmod +x Miniconda3-latest-Linux-x86_64.sh
  - ./Miniconda3-latest-Linux-x86_64.sh -b
- Second set of commands (can copy-paste multiple lines)
  - export PATH="~/miniconda3/bin:$PATH"
  - conda init
Install gcc
- sudo apt-get -y update
- sudo apt-get -y upgrade
- sudo apt-get -y install build-essential
- sudo apt-get -y install libpcre3-dev
Install NVIDIA drivers
- sudo apt install ubuntu-drivers-common -y
- sudo ubuntu-drivers autoinstall
- sudo reboot
- SSH in again and confirm the installation was successful by running nvidia-smi
cd to ARES folder and follow the rest of the README

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

💬 Mini Q&A

⚙️ Installation

📝 Requirements

🚀 Quick Start - #1

🚀 Quick Start - #2

Results Replication

Citation

Appendix

Machine requirements and setup when not using OpenAI API

Files

README.md

Latest commit

History

README.md

File metadata and controls

ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems

💬 Mini Q&A

⚙️ Installation

📝 Requirements

🚀 Quick Start - #1

🚀 Quick Start - #2

Results Replication

Citation

Appendix

Machine requirements and setup when not using OpenAI API