Table of Contents: Installation | Requirements | Quick Start | Citation
ARES is a groundbreaking framework for evaluating Retrieval-Augmented Generation (RAG) models. The automated process combines synthetic data generation with fine-tuned classifiers to efficiently assess context relevance, answer faithfulness, and answer relevance, minimizing the need for extensive human annotations. ARES employs synthetic query generation and Precision-Performance Iteration (PPI), providing accurate evaluations with statistical confidence.
What does ARES assess in RAG models?
ARES conducts a comprehensive evaluation of Retrieval-Augmented Generation (RAG) models, assessing the systems for context relevance, answer faithfulness, and answer relevance. This thorough assessment ensures a complete understanding of the performance of the RAG system.
How does ARES automate the evaluation process?
ARES minimizes the need for human labeling by leveraging fine-tuned classifiers and synthetic data. Its PPI component, Prediction-Powered inference, refines evaluations considering model response variability and provides statistical confidence in the results. By using fine-tuned classifiers and synthetically generated data, ARES cuts down on human labeling needs while providing accurate assessments.
Can ARES handle my custom RAG model?
Yes, ARES is a model-agnostic tool that enables you to generate synthetic queries and answers from your documents. With ARES, you can evaluate these generated queries and answers from your RAG model.
To install ARES, run the following commands:
pip install ares-ai
Optional: Initalize OpenAI or TogetherAI API key with the following command:
export OPENAI_API_KEY=<your key here>
export TOGETHER_API_KEY=<your key here>
To implement ARES for scoring your RAG system and comparing to other RAG configurations, you need three components:
- A human preference validation set of annotated query, document, and answer triples for the evaluation criteria (e.g. context relevance, answer faithfulness, and/or answer relevance). There should be at least 50 examples but several hundred examples is ideal.
- A set of few-shot examples for scoring context relevance, answer faithfulness, and/or answer relevance in your system
- A much larger set of unlabeled query-document-answer triples outputted by your RAG system for scoring
To get started with ARES, you'll need to set up your configuration. Below is an example of a configuration for ARES!
Copy-paste each step to see ARES in action!
Run the following to get the files for quick-start! It includes a few_shot_prompt file, a labeled and unlabeled dataset!
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets/nq_few_shot_prompt_v1.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_unlabeled_output.tsv
*Note: You can run the following command to get the full NQ dataset!
from ares import ARES
ares = ARES()
ares.KILT_dataset("nq")
# Fetches NQ datasets with ratios including 0.5, 0.6, 0.7, etc.
# For purposes of our quick start guide, we rename nq_ratio_0.5 to nq_labeled_output and nq_ratio_0.6 to nq_unlabeled_output.
Step 1) Run the following to see GPT 3.5's accuracy on the NQ unlabeled dataset!
from ares import ARES
ues_idp_config = {
# Dataset for in-domain prompts
"in_domain_prompts_dataset": "nq_few_shot_prompt_v1.tsv",
# Dataset for unlabeled evaluation
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice" : "gpt-3.5-turbo-0125"
}
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}
Step 2) Run the following to see ARES's synthetic generation in action!
from ares import ARES
synth_config = {
"document_filepaths": "nq_labeled_output.tsv",
"few_shot_prompt_filename": "nq_few_shot_prompt_v1.tsv",
"synthetic_queries_filename": "data/output/synthetic_queries_1.tsv",
"documents_sampled": 10000
}
ares_module = ARES(synthetic_query_generator=synth_config)
results = ares_module.generate_synthetic_data()
print(results)
Step 3) Run the following to see ARES's training classifier in action!
from ares import ARES
classifier_config = {
"training_dataset": "output/synthetic_queries_1.tsv",
"validation_set": "nq_labeled_output.tsv",
"label_column": "Answer_Relevance_Label",
"num_epochs": 10,
"patience_value": 3,
"learning_rate": 5e-6
}
ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)
Step 4) Run the following to see ARES's PPI in action!
from ares import ARES
ppi_config = {
"evaluation_datasets": ['nq_labeled_output.tsv'],
"few_shot_examples_filepath": "nq_few_shot_prompt_v1.tsv",
"checkpoints": ["/checkpoints/microsoft-deberta-v3-large/output-synthetic_queries_1.tsv/5e-06_1_True_Context_Relevance_Label_ratio_0.6_reformatted_full_articles_False_validation_with_negatives_428380.pt"], # CHANGE THIS
"labels": ["Context_Relevance_Label"],
"gold_label_path": "nq_unlabeled_output.tsv",
}
ares = ARES(classifier_model=classifier_config)
results = ares.train_classifier()
print(results)
For more details, refer to our documentation.
To get started with ARES's PPI, you'll need to set up your configuration. Below is an example of a configuration for ARES!
Just copy-paste and you'll be good to go!
Step 1) Download necessary datasets
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets/nq_few_shot_prompt_v1.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_labeled_output.tsv
wget https://raw.githubusercontent.com/stanford-futuredata/ARES/new-dev/data/datasets_v2/nq/nq_unlabeled_output.tsv
Step 2) Run the following to retrive the UES/IDP scores with GPT3.5!
from ares import ARES
ues_idp_config = {
# Dataset for in-domain prompts
"in_domain_prompts_dataset": "nq_few_shot_prompt_v1.tsv",
# Dataset for unlabeled evaluation
"unlabeled_evaluation_set": "nq_unlabeled_output.tsv",
"model_choice" : "gpt-3.5-turbo-0125"
}
ares = ARES(ues_idp=ues_idp_config)
results = ares.ues_idp()
print(results)
# {'Context Relevance Scores': [Score], 'Answer Faithfulness Scores': [Score], 'Answer Relevance Scores': [Score]}
Step 3) Run the following to retrive ARES's PPI scores with GPT3.5!
We include synthetic datasets for key experimental results in synthetic_datasets
. The few-shot prompts used for generation and evaluation are included in datasets
. We also include instructions for fine-tuning LLM judges in the paper itself. Please reach out to jonsaadfalcon@stanford.edu or manihani@stanford.edu if you have any further questions.
To cite our work, please use the following Bibtex:
@misc{saadfalcon2023ares,
title={ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems},
author={Jon Saad-Falcon and Omar Khattab and Christopher Potts and Matei Zaharia},
year={2023},
eprint={2311.09476},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Machine requirements
- Over ~100 GB of available disk space
- GPU
- Should work: A100 (e.g.
Standard_NC24ads_A100_v4
on Azure) - Does not work:
- Tested on 2023-12-17 with both
Standard_NC6s_v3
andStandard_NC12s_v3
, and ran into this error:torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 160.00 MiB (GPU 0; 15.77 GiB total capacity; 15.12 GiB already allocated; 95.44 MiB free; 15.12 GiB reserved in total by PyTorch)
- Tested on 2023-12-17 with both
- Should work: A100 (e.g.
Machine setup
For example, on an Azure VM running Linux (ubuntu 20.04), you will need to do the following:
- Install conda
- First set of commands (can copy-paste multiple lines)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
chmod +x Miniconda3-latest-Linux-x86_64.sh
./Miniconda3-latest-Linux-x86_64.sh -b
- Second set of commands (can copy-paste multiple lines)
export PATH="~/miniconda3/bin:$PATH"
conda init
- First set of commands (can copy-paste multiple lines)
- Install gcc
sudo apt-get -y update
sudo apt-get -y upgrade
sudo apt-get -y install build-essential
sudo apt-get -y install libpcre3-dev
- Install NVIDIA drivers
sudo apt install ubuntu-drivers-common -y
sudo ubuntu-drivers autoinstall
sudo reboot
- SSH in again and confirm the installation was successful by running
nvidia-smi
cd
to ARES folder and follow the rest of the README