Table of Contents: Installation | Requirements | Quick Start | Citation
ARES is a groundbreaking framework for evaluating Retrieval-Augmented Generation (RAG) models. The automated process combines synthetic data generation with fine-tuned classifiers to efficiently assess context relevance, answer faithfulness, and answer relevance, minimizing the need for extensive human annotations. ARES employs synthetic query generation and Precision-Performance Iteration (PPI), providing accurate evaluations with statistical confidence.
What does ARES assess in RAG models?
ARES conducts a comprehensive evaluation of Retrieval-Augmented Generation (RAG) models, assessing the systems for context relevance, answer faithfulness, and answer relevance. This thorough assessment ensures a complete understanding of the performance of the RAG system.
How does ARES automate the evaluation process?
ARES minimizes the need for human labeling by leveraging fine-tuned classifiers and synthetic data. Its PPI component, Prediction-Powered inference, refines evaluations considering model response variability and provides statistical confidence in the results. By using fine-tuned classifiers and synthetically generated data, ARES cuts down on human labeling needs while providing accurate assessments.
Can ARES handle my custom RAG model?
Yes, ARES is a model-agnostic tool that enables you to generate synthetic queries and answers from your documents. With ARES, you can evaluate these generated queries and answers from your RAG model. β
β To install ARES, run the following commands: β
pip install ares-ai
β Optional: Initalize OpenAI or TogetherAI API key with the following command:
export OPENAI_API_KEY=<your key here>
export TOGETHER_API_KEY=<your key here>
To implement ARES for scoring your RAG system and comparing to other RAG configurations, you need three components:β
- A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal.
- A set of few-shot examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system
- A much larger set of unlabeled query-document-answer triples outputted by your RAG system for scoring
To get started with ARES, you'll need to set up your configuration. Below is an example of a configuration for ARES!
Copy-paste each step to see ARES in action!
Run the following to get the files for quick-start! It includes a few_shot_prompt file, a labeled and unlabeled dataset!
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets/nq_few_shot_prompt_v1.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_unlabeled_output.tsv
*Note: You can run the following command to get the full NQ dataset!
from ares import ARES
ares = ARES()
ares.KILT_dataset("nq")
# Fetches NQ datasets with ratios including 0.5, 0.6, 0.7, etc.
# For purposes of our quick start guide, we rename nq_ratio_0.5 to nq_labeled_output and nq_ratio_0.6 to nq_unlabeled_output.
Step 1) Run the following to see GPT 3.5's accuracy on the NQ unlabeled dataset!
from ares import ARES
ues_idp_config = {
# Dataset for in-domain prompts
"in_domain_prompts_dataset": "nq_few_shot_prompt_v1.tsv",
# Dataset for unlabeled evaluation
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice" : "gpt-3.5-turbo-0125"
}
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}
Step 2) Run the following to see ARES's synthetic generation in action!
from ares import ARES
synth_config = {
"document_filepaths": "nq_labeled_output.tsv",
"few_shot_prompt_filename": "nq_few_shot_prompt_v1.tsv",
"synthetic_queries_filename": "data/output/synthetic_queries_1.tsv",
"documents_sampled": 10000
}
ares_module = ARES(synthetic_query_generator=synth_config)
results = ares_module.generate_synthetic_data()
print(results)
Step 3) Run the following to see ARES's training classifier in action!
from ares import ARES
classifier_config = {
"training_dataset": "output/synthetic_queries_1.tsv",
"validation_set": "nq_labeled_output.tsv",
"label_column": "Answer_Relevance_Label",
"num_epochs": 10,
"patience_value": 3,
"learning_rate": 5e-6
}
ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)
Step 4) Run the following to see ARES's PPI in action!
from ares import ARES
ppi_config = {
"evaluation_datasets": ['nq_labeled_output.tsv'],
"few_shot_examples_filepath": "nq_few_shot_prompt_v1.tsv",
"checkpoints": ["/checkpoints/microsoft-deberta-v3-large/output-synthetic_queries_1.tsv/5e-06_1_True_Context_Relevance_Label_ratio_0.6_reformatted_full_articles_False_validation_with_negatives_428380.pt"], # CHANGE THIS
"labels": ["Context_Relevance_Label"],
"gold_label_path": "nq_unlabeled_output.tsv",
}
ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)
β For more details, refer to our documentation.
To get started with ARES's PPI, you'll need to set up your configuration. Below is an example of a configuration for ARES!
Just copy-paste and you'll be good to go!
Step 1) Download necessary datasets
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets/nq_few_shot_prompt_v1.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_unlabeled_output.tsv
Step 2) Run the following to retrive the UES/IDP scores with GPT3.5!
from ares import ARES
ues_idp_config = {
# Dataset for in-domain prompts
"in_domain_prompts_dataset": "nq_few_shot_prompt_v1.tsv",
# Dataset for unlabeled evaluation
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice" : "gpt-3.5-turbo-0125"
}
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}
Step 3) Run the following to retrive ARES's PPI scores with GPT3.5!
We include synthetic datasets for key experimental results in synthetic_datasets
. The few-shot prompts used for generation and evaluation are included in datasets
. We also include instructions for fine-tuning LLM judges in the paper itself. Please reach out to jonsaadfalcon@stanford.edu or manihani@stanford.edu if you have any further questions.
To cite our work, please use the following Bibtex:
@misc{saadfalcon2023ares,
title={ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems},
author={Jon Saad-Falcon and Omar Khattab and Christopher Potts and Matei Zaharia},
year={2023},
eprint={2311.09476},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Machine requirements
- Over ~100 GB of available disk space
- GPU
- Should work: A100 (e.g.
Standard_NC24ads_A100_v4
on Azure) - Does not work:
- Tested on 2023-12-17 with both
Standard_NC6s_v3
andStandard_NC12s_v3
, and ran into this error:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.77 GiB total capacity; 15.12 GiB already allocated; 95.44 MiB free; 15.12 GiB reserved in total by PyTorch)
- Tested on 2023-12-17 with both
- Should work: A100 (e.g.
Machine setup
For example, on an Azure VM running Linux (ubuntu 20.04), you will need to do the following:
- Install conda
- First set of commands (can copy-paste multiple lines)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b
- Second set of commands (can copy-paste multiple lines)
export PATH="~/miniconda3/bin:$PATH"
conda init
- First set of commands (can copy-paste multiple lines)
- Install gcc
sudo apt-get -y update
sudo apt-get -y upgrade
sudo apt-get -y install build-essential
sudo apt-get -y install libpcre3-dev
- Install NVIDIA drivers
sudo apt install ubuntu-drivers-common -y
sudo ubuntu-drivers autoinstall
sudo reboot
- SSH in again and confirm the installation was successful by running
nvidia-smi
cd
to ARES folder and follow the rest of the README