|
| 1 | +{ |
| 2 | + "cells": [ |
| 3 | + { |
| 4 | + "cell_type": "markdown", |
| 5 | + "metadata": {}, |
| 6 | + "source": [ |
| 7 | + "# Evaluate using Azure OpenAI Graders with Azure AI Foundry SDK\n", |
| 8 | + "\n", |
| 9 | + "## Objective\n", |
| 10 | + "\n", |
| 11 | + "This tutorial offers a step-by-step guide for evaluating large language models using Azure OpenAI Graders and their model outputs.\n", |
| 12 | + "In Azure AI Foundry SDK, we are now supporting four new AOAI Graders:\n", |
| 13 | + "- Model Labeler: Uses your custom prompt to instruct a model to classify outputs based on labels you define. It returns structured results with explanations for why each label was chosen.\n", |
| 14 | + "- String Check: Compares input text to a reference value, checking for exact or partial matches with optional case insensitivity. Useful for flexible text validations and pattern matching.\n", |
| 15 | + "- Text Similarity: Evaluates how closely input text matches a reference value using similarity metrics like`fuzzy_match`, `BLEU`, `ROUGE`, or `METEOR`. Useful for assessing text quality or semantic closeness.\n", |
| 16 | + "- General Grader: Advanced users have the capability to import or define a custom grader and integrate it into the AOAI general grader. This allows for evaluations to be performed based on specific areas of interest aside from the existing AOAI graders. \n", |
| 17 | + "\n", |
| 18 | + "This tutorial uses the following AI services:\n", |
| 19 | + "- [azure-ai-evaluation](https://learn.microsoft.com/en-us/azure/ai-studio/how-to/develop/evaluate-sdk)\n", |
| 20 | + "\n", |
| 21 | + "## Time\n", |
| 22 | + "\n", |
| 23 | + "You should expect to spend about 15 minutes running this notebook.\n", |
| 24 | + "\n", |
| 25 | + "## Before you begin\n", |
| 26 | + "\n", |
| 27 | + "### Installation\n", |
| 28 | + "\n", |
| 29 | + "Install the following packages requried to execute this notebook." |
| 30 | + ] |
| 31 | + }, |
| 32 | + { |
| 33 | + "cell_type": "code", |
| 34 | + "execution_count": null, |
| 35 | + "metadata": {}, |
| 36 | + "outputs": [], |
| 37 | + "source": [ |
| 38 | + "%pip install azure.ai.projects\n", |
| 39 | + "%pip install azure_ai_evaluation" |
| 40 | + ] |
| 41 | + }, |
| 42 | + { |
| 43 | + "cell_type": "markdown", |
| 44 | + "metadata": {}, |
| 45 | + "source": [ |
| 46 | + "### Parameters and imports" |
| 47 | + ] |
| 48 | + }, |
| 49 | + { |
| 50 | + "cell_type": "code", |
| 51 | + "execution_count": null, |
| 52 | + "metadata": {}, |
| 53 | + "outputs": [], |
| 54 | + "source": [ |
| 55 | + "import os\n", |
| 56 | + "from dotenv import load_dotenv\n", |
| 57 | + "\n", |
| 58 | + "load_dotenv()\n", |
| 59 | + "from azure.ai.evaluation import (\n", |
| 60 | + " AzureOpenAIModelConfiguration,\n", |
| 61 | + " AzureOpenAILabelGrader,\n", |
| 62 | + " AzureOpenAIStringCheckGrader,\n", |
| 63 | + " AzureOpenAITextSimilarityGrader,\n", |
| 64 | + " AzureOpenAIGrader,\n", |
| 65 | + " evaluate,\n", |
| 66 | + ")\n", |
| 67 | + "from openai.types.eval_string_check_grader import EvalStringCheckGrader" |
| 68 | + ] |
| 69 | + }, |
| 70 | + { |
| 71 | + "cell_type": "markdown", |
| 72 | + "metadata": {}, |
| 73 | + "source": [ |
| 74 | + "### Environment variables\n", |
| 75 | + "\n", |
| 76 | + "Set these environment variables with your own values:\n", |
| 77 | + "1) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.\n", |
| 78 | + "2) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.\n", |
| 79 | + "3) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.\n", |
| 80 | + "4) **MODEL_DEPLOYMENT_NAME** - Model deployment to be used for evaluation\n", |
| 81 | + "5) **AZURE_SUBSCRIPTION_ID** - Azure Subscription Id of Azure AI Project\n", |
| 82 | + "6) **PROJECT_NAME** - Azure AI Project Name\n", |
| 83 | + "7) **RESOURCE_GROUP_NAME** - Azure AI Project Resource Group Name" |
| 84 | + ] |
| 85 | + }, |
| 86 | + { |
| 87 | + "cell_type": "code", |
| 88 | + "execution_count": null, |
| 89 | + "metadata": {}, |
| 90 | + "outputs": [], |
| 91 | + "source": [ |
| 92 | + "model_config = AzureOpenAIModelConfiguration(\n", |
| 93 | + " azure_endpoint=os.environ[\"AZURE_OPENAI_ENDPOINT\"],\n", |
| 94 | + " api_key=os.environ[\"AZURE_OPENAI_API_KEY\"],\n", |
| 95 | + " api_version=os.environ[\"AZURE_OPENAI_API_VERSION\"],\n", |
| 96 | + " azure_deployment=os.environ[\"MODEL_DEPLOYMENT_NAME\"],\n", |
| 97 | + ")\n", |
| 98 | + "\n", |
| 99 | + "project = {\n", |
| 100 | + " \"subscription_id\": os.environ[\"AZURE_SUBSCRIPTION_ID\"],\n", |
| 101 | + " \"project_name\": os.environ[\"PROJECT_NAME\"],\n", |
| 102 | + " \"resource_group_name\": os.environ[\"RESOURCE_GROUP_NAME\"],\n", |
| 103 | + "}" |
| 104 | + ] |
| 105 | + }, |
| 106 | + { |
| 107 | + "cell_type": "markdown", |
| 108 | + "metadata": {}, |
| 109 | + "source": [ |
| 110 | + "### Data\n", |
| 111 | + "To run the evaluation against your own data, replace this data file with your sample data" |
| 112 | + ] |
| 113 | + }, |
| 114 | + { |
| 115 | + "cell_type": "code", |
| 116 | + "execution_count": null, |
| 117 | + "metadata": {}, |
| 118 | + "outputs": [], |
| 119 | + "source": [ |
| 120 | + "fname = \"data.jsonl\"" |
| 121 | + ] |
| 122 | + }, |
| 123 | + { |
| 124 | + "cell_type": "markdown", |
| 125 | + "metadata": {}, |
| 126 | + "source": [ |
| 127 | + "## Create grader objects\n", |
| 128 | + "\n", |
| 129 | + "Before executing the evaluation, the grader objects needs to be defined. Aside from `model_config` and customizable grader `name`, each graders have few unique parameters that are required for set up.\n", |
| 130 | + "\n", |
| 131 | + "### Model Labeler\n", |
| 132 | + "\n", |
| 133 | + "This grader uses your custom prompt to instruct a model to classify outputs based on labels you define. It returns structured results with explanations for why each label was chosen. To correctly set up, the following unique parameters are required:\n", |
| 134 | + "- input: Identifies the column in `data.jsonl` that the evaluator will use for classification. It also defines the custom prompt that instructs the model on how to perform the classification.\n", |
| 135 | + "- labels: Lists all possible labels that the evaluator can assign to the input data based on the custom prompt.\n", |
| 136 | + "- passing_labels: Specifies which of the defined labels are considered successful or acceptable outcomes.\n", |
| 137 | + "- model: Indicates the model that will be used to classify the input data according to the custom prompt." |
| 138 | + ] |
| 139 | + }, |
| 140 | + { |
| 141 | + "cell_type": "code", |
| 142 | + "execution_count": null, |
| 143 | + "metadata": {}, |
| 144 | + "outputs": [], |
| 145 | + "source": [ |
| 146 | + "# Determine if the response column contains texts that are too short, just right, or too long\n", |
| 147 | + "label_grader = AzureOpenAILabelGrader(\n", |
| 148 | + " model_config=model_config,\n", |
| 149 | + " input=[\n", |
| 150 | + " {\"content\": \"{{item.response}}\", \"role\": \"user\"},\n", |
| 151 | + " {\n", |
| 152 | + " \"content\": \"Any text including space that's more than 600 characters are too long, less than 500 characters are too short; 500 to 600 characters are just right.\",\n", |
| 153 | + " \"role\": \"user\",\n", |
| 154 | + " \"type\": \"message\",\n", |
| 155 | + " },\n", |
| 156 | + " ],\n", |
| 157 | + " labels=[\"too short\", \"just right\", \"too long\"],\n", |
| 158 | + " passing_labels=[\"just right\"],\n", |
| 159 | + " model=\"gpt-4o\",\n", |
| 160 | + " name=\"label\",\n", |
| 161 | + ")" |
| 162 | + ] |
| 163 | + }, |
| 164 | + { |
| 165 | + "cell_type": "markdown", |
| 166 | + "metadata": {}, |
| 167 | + "source": [ |
| 168 | + "### Text Similarity\n", |
| 169 | + "\n", |
| 170 | + "This grader evaluates how closely input text matches a reference value using similarity metrics like`fuzzy_match`, `BLEU`, `ROUGE`, or `METEOR`. Useful for assessing text quality or semantic closeness. To correctly set up, the following unique parameters are required:\n", |
| 171 | + "- evaluation_metric: Specifies the similarity metric to be used for evaluation.\n", |
| 172 | + "- input: Specifies the column in `data.jsonl` that that contains the text to be evaluated.\n", |
| 173 | + "- pass_threshold: Defines the minimum similarity score required for an output to be considered a passing result in the evaluation.\n", |
| 174 | + "- reference: specifies the column in `data.jsonl` that contains the reference text against which the input will be compared." |
| 175 | + ] |
| 176 | + }, |
| 177 | + { |
| 178 | + "cell_type": "code", |
| 179 | + "execution_count": null, |
| 180 | + "metadata": {}, |
| 181 | + "outputs": [], |
| 182 | + "source": [ |
| 183 | + "# Pass if response column and ground_truth column similarity score >= 0.5 using \"fuzzy_match\"\n", |
| 184 | + "sim_grader = AzureOpenAITextSimilarityGrader(\n", |
| 185 | + " model_config=model_config,\n", |
| 186 | + " evaluation_metric=\"fuzzy_match\", # support evaluation metrics including: \"fuzzy_match\", \"bleu\", \"gleu\", \"meteor\", \"rouge_1\", \"rouge_2\", \"rouge_3\", \"rouge_4\", \"rouge_5\", \"rouge_l\", \"cosine\".\n", |
| 187 | + " input=\"{{item.response}}\",\n", |
| 188 | + " name=\"similarity\",\n", |
| 189 | + " pass_threshold=0.5,\n", |
| 190 | + " reference=\"{{item.ground_truth}}\",\n", |
| 191 | + ")" |
| 192 | + ] |
| 193 | + }, |
| 194 | + { |
| 195 | + "cell_type": "markdown", |
| 196 | + "metadata": {}, |
| 197 | + "source": [ |
| 198 | + "### String Check\n", |
| 199 | + "\n", |
| 200 | + "This grader compares input text to a reference value, checking for exact or partial matches with optional case insensitivity. Useful for flexible text validations and pattern matching. To correctly set up, the following unique parameters are required:\n", |
| 201 | + "- input: Specifies the column in `data.jsonl` that that contains the text to be evaluated.\n", |
| 202 | + "- operation: Defines the operation type of this grader object.\n", |
| 203 | + "- reference: specifies the reference value against which the input will be compared." |
| 204 | + ] |
| 205 | + }, |
| 206 | + { |
| 207 | + "cell_type": "code", |
| 208 | + "execution_count": null, |
| 209 | + "metadata": {}, |
| 210 | + "outputs": [], |
| 211 | + "source": [ |
| 212 | + "# Pass if the query column contains \"What is\"\n", |
| 213 | + "string_grader = AzureOpenAIStringCheckGrader(\n", |
| 214 | + " model_config=model_config,\n", |
| 215 | + " input=\"{{item.query}}\",\n", |
| 216 | + " name=\"Contains What is\",\n", |
| 217 | + " operation=\"like\", # \"eq\" for equal, \"ne\" for not equal, \"like\" for contain, \"ilike\" for case insensitive contain\n", |
| 218 | + " reference=\"What is\",\n", |
| 219 | + ")" |
| 220 | + ] |
| 221 | + }, |
| 222 | + { |
| 223 | + "cell_type": "markdown", |
| 224 | + "metadata": {}, |
| 225 | + "source": [ |
| 226 | + "### General Grader\n", |
| 227 | + "\n", |
| 228 | + "This grader enables advanced users to import or define a custom grader and integrate it into the AOAI general grader. This allows for evaluations to be performed based on specific areas of interest aside from the existing AOAI graders. To correctly set up, a separate grader object needs to be created and defined. The defined grader can be passed into the general grader using parameter `grader_config`." |
| 229 | + ] |
| 230 | + }, |
| 231 | + { |
| 232 | + "cell_type": "code", |
| 233 | + "execution_count": null, |
| 234 | + "metadata": {}, |
| 235 | + "outputs": [], |
| 236 | + "source": [ |
| 237 | + "# Define an string check grader config directly using the OAI SDK\n", |
| 238 | + "oai_string_check_grader = EvalStringCheckGrader(\n", |
| 239 | + " input=\"{{item.query}}\", name=\"contains hello\", operation=\"like\", reference=\"hello\", type=\"string_check\"\n", |
| 240 | + ")\n", |
| 241 | + "# Plug that into the general grader\n", |
| 242 | + "general_grader = AzureOpenAIGrader(model_config=model_config, grader_config=oai_string_check_grader)" |
| 243 | + ] |
| 244 | + }, |
| 245 | + { |
| 246 | + "cell_type": "markdown", |
| 247 | + "metadata": {}, |
| 248 | + "source": [ |
| 249 | + "## Evaluation\n", |
| 250 | + "\n", |
| 251 | + "Once all the grader objects have been correctly set up, we can evaluate the test dataset against the graders using the `evaluate` method. Optionally, you may add `azure_ai_project=project` in the evaluate call to upload the evaluation result to your Foundry project." |
| 252 | + ] |
| 253 | + }, |
| 254 | + { |
| 255 | + "cell_type": "code", |
| 256 | + "execution_count": null, |
| 257 | + "metadata": {}, |
| 258 | + "outputs": [], |
| 259 | + "source": [ |
| 260 | + "evaluation = evaluate(\n", |
| 261 | + " data=fname,\n", |
| 262 | + " evaluators={\n", |
| 263 | + " \"label\": label_grader,\n", |
| 264 | + " \"general\": general_grader,\n", |
| 265 | + " \"string\": string_grader,\n", |
| 266 | + " \"similarity\": sim_grader,\n", |
| 267 | + " },\n", |
| 268 | + " # azure_ai_project=project\n", |
| 269 | + ")\n", |
| 270 | + "evaluation" |
| 271 | + ] |
| 272 | + } |
| 273 | + ], |
| 274 | + "metadata": { |
| 275 | + "kernelspec": { |
| 276 | + "display_name": "Python 3", |
| 277 | + "language": "python", |
| 278 | + "name": "python3" |
| 279 | + }, |
| 280 | + "language_info": { |
| 281 | + "codemirror_mode": { |
| 282 | + "name": "ipython", |
| 283 | + "version": 3 |
| 284 | + }, |
| 285 | + "file_extension": ".py", |
| 286 | + "mimetype": "text/x-python", |
| 287 | + "name": "python", |
| 288 | + "nbconvert_exporter": "python", |
| 289 | + "pygments_lexer": "ipython3" |
| 290 | + } |
| 291 | + }, |
| 292 | + "nbformat": 4, |
| 293 | + "nbformat_minor": 2 |
| 294 | +} |
0 commit comments