Adding AOAI Grader Code Sample (#254)

Spiderpc · Mike Shi · web-flow · commit 5bd03ae6ce39 · 2025-05-20T11:21:11.000-07:00
* Add sample code for Azure OpenAI graders Foundry SDK experience

* Add readme doc

* updated readme doc

* Updated text for Readme and code sample

---------

Co-authored-by: Mike Shi &lt;peichengshi@microsoft.com&gt;
diff --git a/scenarios/evaluate/Azure_OpenAI_Graders/Azure_OpenAI_Graders.ipynb b/scenarios/evaluate/Azure_OpenAI_Graders/Azure_OpenAI_Graders.ipynb
@@ -0,0 +1,294 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Evaluate using Azure OpenAI Graders with Azure AI Foundry SDK\n",
+    "\n",
+    "## Objective\n",
+    "\n",
+    "This tutorial offers a step-by-step guide for evaluating large language models using Azure OpenAI Graders and their model outputs.\n",
+    "In Azure AI Foundry SDK, we are now supporting four new AOAI Graders:\n",
+    "- Model Labeler: Uses your custom prompt to instruct a model to classify outputs based on labels you define. It returns structured results with explanations for why each label was chosen.\n",
+    "- String Check: Compares input text to a reference value, checking for exact or partial matches with optional case insensitivity. Useful for flexible text validations and pattern matching.\n",
+    "- Text Similarity: Evaluates how closely input text matches a reference value using similarity metrics like`fuzzy_match`, `BLEU`, `ROUGE`, or `METEOR`. Useful for assessing text quality or semantic closeness.\n",
+    "- General Grader: Advanced users have the capability to import or define a custom grader and integrate it into the AOAI general grader. This allows for evaluations to be performed based on specific areas of interest aside from the existing AOAI graders. \n",
+    "\n",
+    "This tutorial uses the following AI services:\n",
+    "- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n",
+    "\n",
+    "## Time\n",
+    "\n",
+    "You should expect to spend about 15 minutes running this notebook.\n",
+    "\n",
+    "## Before you begin\n",
+    "\n",
+    "### Installation\n",
+    "\n",
+    "Install the following packages requried to execute this notebook."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%pip install azure.ai.projects\n",
+    "%pip install azure_ai_evaluation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Parameters and imports"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "from dotenv import load_dotenv\n",
+    "\n",
+    "load_dotenv()\n",
+    "from azure.ai.evaluation import (\n",
+    "    AzureOpenAIModelConfiguration,\n",
+    "    AzureOpenAILabelGrader,\n",
+    "    AzureOpenAIStringCheckGrader,\n",
+    "    AzureOpenAITextSimilarityGrader,\n",
+    "    AzureOpenAIGrader,\n",
+    "    evaluate,\n",
+    ")\n",
+    "from openai.types.eval_string_check_grader import EvalStringCheckGrader"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Environment variables\n",
+    "\n",
+    "Set these environment variables with your own values:\n",
+    "1) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.\n",
+    "2) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.\n",
+    "3) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.\n",
+    "4) **MODEL_DEPLOYMENT_NAME** - Model deployment to be used for evaluation\n",
+    "5) **AZURE_SUBSCRIPTION_ID** - Azure Subscription Id of Azure AI Project\n",
+    "6) **PROJECT_NAME** - Azure AI Project Name\n",
+    "7) **RESOURCE_GROUP_NAME** - Azure AI Project Resource Group Name"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "model_config = AzureOpenAIModelConfiguration(\n",
+    "    azure_endpoint=os.environ[\"AZURE_OPENAI_ENDPOINT\"],\n",
+    "    api_key=os.environ[\"AZURE_OPENAI_API_KEY\"],\n",
+    "    api_version=os.environ[\"AZURE_OPENAI_API_VERSION\"],\n",
+    "    azure_deployment=os.environ[\"MODEL_DEPLOYMENT_NAME\"],\n",
+    ")\n",
+    "\n",
+    "project = {\n",
+    "    \"subscription_id\": os.environ[\"AZURE_SUBSCRIPTION_ID\"],\n",
+    "    \"project_name\": os.environ[\"PROJECT_NAME\"],\n",
+    "    \"resource_group_name\": os.environ[\"RESOURCE_GROUP_NAME\"],\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Data\n",
+    "To run the evaluation against your own data, replace this data file with your sample data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "fname = \"data.jsonl\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Create grader objects\n",
+    "\n",
+    "Before executing the evaluation, the grader objects needs to be defined. Aside from `model_config` and customizable grader `name`, each graders have few unique parameters that are required for set up.\n",
+    "\n",
+    "### Model Labeler\n",
+    "\n",
+    "This grader uses your custom prompt to instruct a model to classify outputs based on labels you define. It returns structured results with explanations for why each label was chosen. To correctly set up, the following unique parameters are required:\n",
+    "- input: Identifies the column in `data.jsonl` that the evaluator will use for classification. It also defines the custom prompt that instructs the model on how to perform the classification.\n",
+    "- labels: Lists all possible labels that the evaluator can assign to the input data based on the custom prompt.\n",
+    "- passing_labels: Specifies which of the defined labels are considered successful or acceptable outcomes.\n",
+    "- model: Indicates the model that will be used to classify the input data according to the custom prompt."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Determine if the response column contains texts that are too short, just right, or too long\n",
+    "label_grader = AzureOpenAILabelGrader(\n",
+    "    model_config=model_config,\n",
+    "    input=[\n",
+    "        {\"content\": \"{{item.response}}\", \"role\": \"user\"},\n",
+    "        {\n",
+    "            \"content\": \"Any text including space that's more than 600 characters are too long, less than 500 characters are too short; 500 to 600 characters are just right.\",\n",
+    "            \"role\": \"user\",\n",
+    "            \"type\": \"message\",\n",
+    "        },\n",
+    "    ],\n",
+    "    labels=[\"too short\", \"just right\", \"too long\"],\n",
+    "    passing_labels=[\"just right\"],\n",
+    "    model=\"gpt-4o\",\n",
+    "    name=\"label\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Text Similarity\n",
+    "\n",
+    "This grader evaluates how closely input text matches a reference value using similarity metrics like`fuzzy_match`, `BLEU`, `ROUGE`, or `METEOR`. Useful for assessing text quality or semantic closeness. To correctly set up, the following unique parameters are required:\n",
+    "- evaluation_metric: Specifies the similarity metric to be used for evaluation.\n",
+    "- input: Specifies the column in `data.jsonl` that that contains the text to be evaluated.\n",
+    "- pass_threshold: Defines the minimum similarity score required for an output to be considered a passing result in the evaluation.\n",
+    "- reference: specifies the column in `data.jsonl` that contains the reference text against which the input will be compared."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Pass if response column and ground_truth column similarity score >= 0.5 using \"fuzzy_match\"\n",
+    "sim_grader = AzureOpenAITextSimilarityGrader(\n",
+    "    model_config=model_config,\n",
+    "    evaluation_metric=\"fuzzy_match\",  # support evaluation metrics including: \"fuzzy_match\", \"bleu\", \"gleu\", \"meteor\", \"rouge_1\", \"rouge_2\", \"rouge_3\", \"rouge_4\", \"rouge_5\", \"rouge_l\", \"cosine\".\n",
+    "    input=\"{{item.response}}\",\n",
+    "    name=\"similarity\",\n",
+    "    pass_threshold=0.5,\n",
+    "    reference=\"{{item.ground_truth}}\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### String Check\n",
+    "\n",
+    "This grader compares input text to a reference value, checking for exact or partial matches with optional case insensitivity. Useful for flexible text validations and pattern matching. To correctly set up, the following unique parameters are required:\n",
+    "- input: Specifies the column in `data.jsonl` that that contains the text to be evaluated.\n",
+    "- operation: Defines the operation type of this grader object.\n",
+    "- reference: specifies the reference value against which the input will be compared."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Pass if the query column contains \"What is\"\n",
+    "string_grader = AzureOpenAIStringCheckGrader(\n",
+    "    model_config=model_config,\n",
+    "    input=\"{{item.query}}\",\n",
+    "    name=\"Contains What is\",\n",
+    "    operation=\"like\",  # \"eq\" for equal, \"ne\" for not equal, \"like\" for contain, \"ilike\" for case insensitive contain\n",
+    "    reference=\"What is\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### General Grader\n",
+    "\n",
+    "This grader enables advanced users to import or define a custom grader and integrate it into the AOAI general grader. This allows for evaluations to be performed based on specific areas of interest aside from the existing AOAI graders. To correctly set up, a separate grader object needs to be created and defined. The defined grader can be passed into the general grader using parameter `grader_config`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define an string check grader config directly using the OAI SDK\n",
+    "oai_string_check_grader = EvalStringCheckGrader(\n",
+    "    input=\"{{item.query}}\", name=\"contains hello\", operation=\"like\", reference=\"hello\", type=\"string_check\"\n",
+    ")\n",
+    "# Plug that into the general grader\n",
+    "general_grader = AzureOpenAIGrader(model_config=model_config, grader_config=oai_string_check_grader)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Evaluation\n",
+    "\n",
+    "Once all the grader objects have been correctly set up, we can evaluate the test dataset against the graders using the `evaluate` method. Optionally, you may add `azure_ai_project=project` in the evaluate call to upload the evaluation result to your Foundry project."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "evaluation = evaluate(\n",
+    "    data=fname,\n",
+    "    evaluators={\n",
+    "        \"label\": label_grader,\n",
+    "        \"general\": general_grader,\n",
+    "        \"string\": string_grader,\n",
+    "        \"similarity\": sim_grader,\n",
+    "    },\n",
+    "    # azure_ai_project=project\n",
+    ")\n",
+    "evaluation"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/scenarios/evaluate/Azure_OpenAI_Graders/README.md b/scenarios/evaluate/Azure_OpenAI_Graders/README.md
@@ -0,0 +1,31 @@
+---
+page_type: sample
+languages:
+- python
+products:
+- ai-services
+- azure-openai
+description: Evaluate using Azure OpenAI Graders with Azure AI Foundry SDK
+---
+
+
+# Evaluate using Azure OpenAI Graders with Azure AI Foundry SDK
+
+## Overview
+
+This tutorial offers a step-by-step guide for evaluating large language models using Azure OpenAI Graders and their model outputs with Azure AI Foundry SDK.
+
+## Objective
+
+The main objective for this tutorial is to help users understand how to utilize the four new AOAI Graders we now support in Azure AI Foundry SDK:
+
+- Model Labeler
+- String Check
+- Text Similarity
+- General Grader
+
+## Programming Languages
+
+- Python
+
+## Estimated Runtime: 15 mins
diff --git a/scenarios/evaluate/Azure_OpenAI_Graders/data.jsonl b/scenarios/evaluate/Azure_OpenAI_Graders/data.jsonl
@@ -0,0 +1,27 @@
+[
+    {
+        "query": "What is the importance of choosing the right provider in getting the most value out of your health insurance plan?",
+        "ground_truth": "Choosing an in-network provider helps you save money and ensures better, more personalized care. [Northwind_Health_Plus_Benefits_Details-3.pdf]",
+        "response": "Choosing the right provider is key to maximizing your health insurance benefits. In-network providers reduce costs, offer better coverage, and support continuity of care, leading to more effective and personalized treatment. [Northwind_Health_Plus_Benefits_Details.pdf][Northwind_Standard_Benefits_Details.pdf]"
+    },
+    {
+        "query": "What should you do when choosing an in-network provider for your health care needs?",
+        "ground_truth": "Check with Northwind Health Plus to confirm the provider is in-network, as this helps reduce costs.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
+        "response": "To choose an in-network provider, confirm they are part of your plan using the Northwind website or by calling their office. Also consider location, availability, and office hours. [Northwind_Standard_Benefits_Details.pdf]"
+    },
+    {
+        "query": "What range of in-network providers does Northwind Health Plus offer?",
+        "ground_truth": "Northwind Health Plus offers a wide range of in-network providers, including doctors, specialists, hospitals, and pharmacies.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
+        "response": "Northwind Health Plus offers a wide range of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]. The exact range of in-network providers may vary depending on the area, but the Northwind Health Plus website provides a searchable directory of all in-network providers in your area [Northwind_Health_Plus_Benefits_Details.pdf]. It is important to check with Northwind Health Plus to ensure that the provider you are considering is in-network [Northwind_Health_Plus_Benefits_Details.pdf]."
+    },
+    {
+        "query": "What is the limitation of in-network providers?",
+        "ground_truth": "The limitation of in-network providers is that they may not always be available in every area.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
+        "response": "The limitation of in-network providers is that they may not accept the amount of payment offered by Northwind Health, which means you may be responsible for a greater portion of the cost [Northwind_Standard_Benefits_Details.pdf]. Additionally, out-of-network providers may not offer additional services or discounts that are available to in-network providers [Northwind_Standard_Benefits_Details.pdf]. It is recommended to choose an in-network provider whenever possible to ensure the best value for your health care expenses [Northwind_Standard_Benefits_Details.pdf].\n\nSources:\n- Northwind_Standard_Benefits_Details.pdf"
+    },
+    {
+        "query": "What resource does Northwind Health Plus provide to find in-network providers in your area?",
+        "ground_truth": "The Northwind Health Plus website offers a searchable directory of all in-network providers in your area. This directory is regularly updated, so you can be sure that you are choosing from in-network providers that are available.\n[Northwind_Health_Plus_Benefits_Details-3.pdf]",
+        "response": "Northwind Health Plus provides a variety of in-network providers, including primary care physicians, specialists, hospitals, and pharmacies [Northwind_Health_Plus_Benefits_Details.pdf]."
+    }
+]
diff --git a/scenarios/evaluate/README.md b/scenarios/evaluate/README.md
@@ -55,6 +55,7 @@ The main objective of this tutorial is to help users understand the process of e
 | [NLP_Evaluators.ipynb](Supported_Evaluation_Metrics/NLP_Evaluators/NLP_Evaluators.ipynb)            |             |           |                     |       |         | X                      |             |                     |               | X                     |
 | [AI_Judge_Evaluators_Safety_Risks.ipynb](Supported_Evaluation_Metrics/AI_Judge_Evaluators_Safety_Risks/AI_Judge_Evaluators_Safety_Risks_Text.ipynb)            | X           |           |                     |       |          | X                     |             |                     |                |                      |
 | [Simulate_Evaluate_Groundedness.py](Simulators/Simulate_Evaluate_Groundedness/Simulate_Evaluate_Groundedness.ipynb)      |             | X         |                     |      | X        | X                     |             | X                    |                |                    |
+| [Azure_OpenAI_Graders.ipynb](Azure_OpenAI_Graders/Azure_OpenAI_Graders.ipynb)      |             |           |                     |      | X        | X                     |             | X                    |                |                    |