Skip to content

Commit e077570

Browse files
changliu2singankit
andauthored
added samples for agent eval (#223)
* added samples for agent eval * minor update on notebook comment * minor update on notebook comment * Updating notebooks for agent evaluators * Fixing agent overall notebook * Fixing linting issues * Fixing issues * Fixing ruff issues * Fixing nb clean issues * Removing old data file * Running pre commit hools --------- Co-authored-by: Ankit Singhal <anksing@microsoft.com>
1 parent 300e5bc commit e077570

8 files changed

+1926
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,357 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"# Intent Resolution Evaluator\n",
8+
"\n",
9+
"## Objective\n",
10+
"This sample demonstrates to how to use intent resolution evaluator on agent data. The supported input formats include:\n",
11+
"- simple data such as strings;\n",
12+
"- user-agent conversations in the form of list of agent messages. \n",
13+
"\n",
14+
"## Time\n",
15+
"\n",
16+
"You should expect to spend about 20 minutes running this notebook. \n",
17+
"\n",
18+
"## Before you begin\n",
19+
"For quality evaluation, you need to deploy a `gpt` model supporting JSON mode. We recommend a model `gpt-4o` or `gpt-4o-mini` for their strong reasoning capabilities. \n",
20+
"\n",
21+
"### Prerequisite\n",
22+
"```bash\n",
23+
"pip install azure-ai-projects azure-identity azure-ai-evaluation\n",
24+
"```\n",
25+
"Set these environment variables with your own values:\n",
26+
"1) **PROJECT_CONNECTION_STRING** - The project connection string, as found in the overview page of your Azure AI Foundry project.\n",
27+
"2) **MODEL_DEPLOYMENT_NAME** - The deployment name of the model for this AI-assisted evaluator, as found under the \"Name\" column in the \"Models + endpoints\" tab in your Azure AI Foundry project.\n",
28+
"3) **AZURE_OPENAI_ENDPOINT** - Azure Open AI Endpoint to be used for evaluation.\n",
29+
"4) **AZURE_OPENAI_API_KEY** - Azure Open AI Key to be used for evaluation.\n",
30+
"5) **AZURE_OPENAI_API_VERSION** - Azure Open AI Api version to be used for evaluation.\n",
31+
"6) **AZURE_SUBSCRIPTION_ID** - Azure Subscription Id of Azure AI Project\n",
32+
"7) **PROJECT_NAME** - Azure AI Project Name\n",
33+
"8) **RESOURCE_GROUP_NAME** - Azure AI Project Resource Group Name\n"
34+
]
35+
},
36+
{
37+
"cell_type": "markdown",
38+
"metadata": {},
39+
"source": [
40+
"The Intent Resolution evaluator measures how well an agent has identified and resolved the user intent.\n",
41+
"The scoring is on a 1-5 integer scale and is as follows:\n",
42+
"\n",
43+
" - Score 1: Response completely unrelated to user intent\n",
44+
" - Score 2: Response minimally relates to user intent\n",
45+
" - Score 3: Response partially addresses the user intent but lacks complete details\n",
46+
" - Score 4: Response addresses the user intent with moderate accuracy but has minor inaccuracies or omissions\n",
47+
" - Score 5: Response directly addresses the user intent and fully resolves it\n",
48+
"\n",
49+
"The evaluation requires the following inputs:\n",
50+
"\n",
51+
" - Query : The user query. Either a string with a user request or a list of messages with previous requests from the user and responses from the assistant, potentially including a system message.\n",
52+
" - Response : The response to be evaluated. Either a string or a message with the response from the agent to the last user query.\n",
53+
"\n",
54+
"There is a third optional parameter:\n",
55+
" - ToolDefinitions : The list of tool definitions the agent can call. This may be useful for the evaluator to better assess if the right tool was called to resolve a given intent."
56+
]
57+
},
58+
{
59+
"cell_type": "markdown",
60+
"metadata": {},
61+
"source": [
62+
"### Initialize Intent Resolution Evaluator\n"
63+
]
64+
},
65+
{
66+
"cell_type": "code",
67+
"execution_count": null,
68+
"metadata": {},
69+
"outputs": [],
70+
"source": [
71+
"import os\n",
72+
"from azure.ai.evaluation import AzureOpenAIModelConfiguration\n",
73+
"from azure.ai.evaluation import IntentResolutionEvaluator\n",
74+
"from pprint import pprint\n",
75+
"\n",
76+
"model_config = AzureOpenAIModelConfiguration(\n",
77+
" azure_endpoint=os.environ[\"AZURE_OPENAI_ENDPOINT\"],\n",
78+
" api_key=os.environ[\"AZURE_OPENAI_API_KEY\"],\n",
79+
" api_version=os.environ[\"AZURE_OPENAI_API_VERSION\"],\n",
80+
" azure_deployment=os.environ[\"MODEL_DEPLOYMENT_NAME\"],\n",
81+
")\n",
82+
"\n",
83+
"intent_resolution_evaluator = IntentResolutionEvaluator(model_config)"
84+
]
85+
},
86+
{
87+
"cell_type": "markdown",
88+
"metadata": {},
89+
"source": [
90+
"### Samples"
91+
]
92+
},
93+
{
94+
"cell_type": "markdown",
95+
"metadata": {},
96+
"source": [
97+
"#### Evaluating query and response as string"
98+
]
99+
},
100+
{
101+
"cell_type": "code",
102+
"execution_count": null,
103+
"metadata": {},
104+
"outputs": [],
105+
"source": [
106+
"# Success example. Intent is identified and understood and the response correctly resolves user intent\n",
107+
"result = intent_resolution_evaluator(\n",
108+
" query=\"What are the opening hours of the Eiffel Tower?\",\n",
109+
" response=\"Opening hours of the Eiffel Tower are 9:00 AM to 11:00 PM.\",\n",
110+
")\n",
111+
"pprint(result)"
112+
]
113+
},
114+
{
115+
"cell_type": "code",
116+
"execution_count": null,
117+
"metadata": {},
118+
"outputs": [],
119+
"source": [
120+
"# Failure example. Even though intent is correctly identified, the response does not resolve the user intent\n",
121+
"result = intent_resolution_evaluator(\n",
122+
" query=\"What is the opening hours of the Eiffel Tower?\",\n",
123+
" response=\"Please check the official website for the up-to-date information on Eiffel Tower opening hours.\",\n",
124+
")\n",
125+
"pprint(result)"
126+
]
127+
},
128+
{
129+
"cell_type": "markdown",
130+
"metadata": {},
131+
"source": [
132+
"#### Evaluating query and response as list of messages"
133+
]
134+
},
135+
{
136+
"cell_type": "code",
137+
"execution_count": null,
138+
"metadata": {},
139+
"outputs": [],
140+
"source": [
141+
"query = [\n",
142+
" {\"role\": \"system\", \"content\": \"You are a friendly and helpful customer service agent.\"},\n",
143+
" {\n",
144+
" \"createdAt\": \"2025-03-14T06:14:20Z\",\n",
145+
" \"role\": \"user\",\n",
146+
" \"content\": [\n",
147+
" {\n",
148+
" \"type\": \"text\",\n",
149+
" \"text\": \"Hi, I need help with the last 2 orders on my account #888. Could you please update me on their status?\",\n",
150+
" }\n",
151+
" ],\n",
152+
" },\n",
153+
"]\n",
154+
"\n",
155+
"response = [\n",
156+
" {\n",
157+
" \"createdAt\": \"2025-03-14T06:14:30Z\",\n",
158+
" \"run_id\": \"0\",\n",
159+
" \"role\": \"assistant\",\n",
160+
" \"content\": [{\"type\": \"text\", \"text\": \"Hello! Let me quickly look up your account details.\"}],\n",
161+
" },\n",
162+
" {\n",
163+
" \"createdAt\": \"2025-03-14T06:14:35Z\",\n",
164+
" \"run_id\": \"0\",\n",
165+
" \"role\": \"assistant\",\n",
166+
" \"content\": [\n",
167+
" {\n",
168+
" \"type\": \"tool_call\",\n",
169+
" \"tool_call_id\": \"tool_call_20250310_001\",\n",
170+
" \"name\": \"get_orders\",\n",
171+
" \"arguments\": {\"account_number\": \"888\"},\n",
172+
" }\n",
173+
" ],\n",
174+
" },\n",
175+
" {\n",
176+
" \"createdAt\": \"2025-03-14T06:14:40Z\",\n",
177+
" \"run_id\": \"0\",\n",
178+
" \"tool_call_id\": \"tool_call_20250310_001\",\n",
179+
" \"role\": \"tool\",\n",
180+
" \"content\": [{\"type\": \"tool_result\", \"tool_result\": '[{ \"order_id\": \"123\" }, { \"order_id\": \"124\" }]'}],\n",
181+
" },\n",
182+
" {\n",
183+
" \"createdAt\": \"2025-03-14T06:14:45Z\",\n",
184+
" \"run_id\": \"0\",\n",
185+
" \"role\": \"assistant\",\n",
186+
" \"content\": [\n",
187+
" {\n",
188+
" \"type\": \"text\",\n",
189+
" \"text\": \"Thanks for your patience. I see two orders on your account. Let me fetch the details for both.\",\n",
190+
" }\n",
191+
" ],\n",
192+
" },\n",
193+
" {\n",
194+
" \"createdAt\": \"2025-03-14T06:14:50Z\",\n",
195+
" \"run_id\": \"0\",\n",
196+
" \"role\": \"assistant\",\n",
197+
" \"content\": [\n",
198+
" {\n",
199+
" \"type\": \"tool_call\",\n",
200+
" \"tool_call_id\": \"tool_call_20250310_002\",\n",
201+
" \"name\": \"get_order\",\n",
202+
" \"arguments\": {\"order_id\": \"123\"},\n",
203+
" },\n",
204+
" {\n",
205+
" \"type\": \"tool_call\",\n",
206+
" \"tool_call_id\": \"tool_call_20250310_003\",\n",
207+
" \"name\": \"get_order\",\n",
208+
" \"arguments\": {\"order_id\": \"124\"},\n",
209+
" },\n",
210+
" ],\n",
211+
" },\n",
212+
" {\n",
213+
" \"createdAt\": \"2025-03-14T06:14:55Z\",\n",
214+
" \"run_id\": \"0\",\n",
215+
" \"tool_call_id\": \"tool_call_20250310_002\",\n",
216+
" \"role\": \"tool\",\n",
217+
" \"content\": [\n",
218+
" {\n",
219+
" \"type\": \"tool_result\",\n",
220+
" \"tool_result\": '{ \"order\": { \"id\": \"123\", \"status\": \"shipped\", \"delivery_date\": \"2025-03-15\" } }',\n",
221+
" }\n",
222+
" ],\n",
223+
" },\n",
224+
" {\n",
225+
" \"createdAt\": \"2025-03-14T06:15:00Z\",\n",
226+
" \"run_id\": \"0\",\n",
227+
" \"tool_call_id\": \"tool_call_20250310_003\",\n",
228+
" \"role\": \"tool\",\n",
229+
" \"content\": [\n",
230+
" {\n",
231+
" \"type\": \"tool_result\",\n",
232+
" \"tool_result\": '{ \"order\": { \"id\": \"124\", \"status\": \"delayed\", \"expected_delivery\": \"2025-03-20\" } }',\n",
233+
" }\n",
234+
" ],\n",
235+
" },\n",
236+
" {\n",
237+
" \"createdAt\": \"2025-03-14T06:15:05Z\",\n",
238+
" \"run_id\": \"0\",\n",
239+
" \"role\": \"assistant\",\n",
240+
" \"content\": [\n",
241+
" {\n",
242+
" \"type\": \"text\",\n",
243+
" \"text\": \"The order with ID 123 has been shipped and is expected to be delivered on March 15, 2025. However, the order with ID 124 is delayed and should now arrive by March 20, 2025. Is there anything else I can help you with?\",\n",
244+
" }\n",
245+
" ],\n",
246+
" },\n",
247+
"]\n",
248+
"\n",
249+
"# please note that the tool definitions are not strictly required, and that some of the tools below are not used in the example above and that is ok.\n",
250+
"# if context length is a concern you can remove the unused tool definitions or even the tool definitions altogether as the impact to the intent resolution evaluation is usual minimal.\n",
251+
"tool_definitions = [\n",
252+
" {\n",
253+
" \"name\": \"get_orders\",\n",
254+
" \"description\": \"Get the list of orders for a given account number.\",\n",
255+
" \"parameters\": {\n",
256+
" \"type\": \"object\",\n",
257+
" \"properties\": {\n",
258+
" \"account_number\": {\"type\": \"string\", \"description\": \"The account number to get the orders for.\"}\n",
259+
" },\n",
260+
" },\n",
261+
" },\n",
262+
" {\n",
263+
" \"name\": \"get_order\",\n",
264+
" \"description\": \"Get the details of a specific order.\",\n",
265+
" \"parameters\": {\n",
266+
" \"type\": \"object\",\n",
267+
" \"properties\": {\"order_id\": {\"type\": \"string\", \"description\": \"The order ID to get the details for.\"}},\n",
268+
" },\n",
269+
" },\n",
270+
" {\n",
271+
" \"name\": \"initiate_return\",\n",
272+
" \"description\": \"Initiate the return process for an order.\",\n",
273+
" \"parameters\": {\n",
274+
" \"type\": \"object\",\n",
275+
" \"properties\": {\"order_id\": {\"type\": \"string\", \"description\": \"The order ID for the return process.\"}},\n",
276+
" },\n",
277+
" },\n",
278+
" {\n",
279+
" \"name\": \"update_shipping_address\",\n",
280+
" \"description\": \"Update the shipping address for a given account.\",\n",
281+
" \"parameters\": {\n",
282+
" \"type\": \"object\",\n",
283+
" \"properties\": {\n",
284+
" \"account_number\": {\"type\": \"string\", \"description\": \"The account number to update.\"},\n",
285+
" \"new_address\": {\"type\": \"string\", \"description\": \"The new shipping address.\"},\n",
286+
" },\n",
287+
" },\n",
288+
" },\n",
289+
"]\n",
290+
"\n",
291+
"result = intent_resolution_evaluator(\n",
292+
" query=query,\n",
293+
" response=response,\n",
294+
" tool_definitions=tool_definitions,\n",
295+
")\n",
296+
"pprint(result)"
297+
]
298+
},
299+
{
300+
"cell_type": "markdown",
301+
"metadata": {},
302+
"source": [
303+
"## Batch evaluate and visualize results on Azure AI Foundry\n",
304+
"Batch evaluate to leverage asynchronous evaluation on a dataset. \n",
305+
"\n",
306+
"Optionally, you can go to AI Foundry URL for rich Azure AI Foundry data visualization. You can inspect the evaluation scores and reasoning to quickly identify bugs and issues of your agent to fix and improve. Make sure to authenticate to Azure using `az login` in your terminal before running this cell.\n"
307+
]
308+
},
309+
{
310+
"cell_type": "code",
311+
"execution_count": null,
312+
"metadata": {},
313+
"outputs": [],
314+
"source": [
315+
"from azure.ai.evaluation import evaluate\n",
316+
"\n",
317+
"# This sample files contains the evaluation data in JSONL format. Where each line is a run from agent.\n",
318+
"# This was saved using agent thread and converter.\n",
319+
"file_name = \"evaluation_data.jsonl\"\n",
320+
"\n",
321+
"response = evaluate(\n",
322+
" data=file_name,\n",
323+
" evaluation_name=\"Intent Resolution Evaluation\",\n",
324+
" evaluators={\n",
325+
" \"intent_resolution\": intent_resolution_evaluator,\n",
326+
" },\n",
327+
" azure_ai_project={\n",
328+
" \"subscription_id\": os.environ[\"AZURE_SUBSCRIPTION_ID\"],\n",
329+
" \"project_name\": os.environ[\"PROJECT_NAME\"],\n",
330+
" \"resource_group_name\": os.environ[\"RESOURCE_GROUP_NAME\"],\n",
331+
" },\n",
332+
")\n",
333+
"pprint(f'AI Foundary URL: {response.get(\"studio_url\")}')"
334+
]
335+
}
336+
],
337+
"metadata": {
338+
"kernelspec": {
339+
"display_name": "test_agent_eval_sample",
340+
"language": "python",
341+
"name": "python3"
342+
},
343+
"language_info": {
344+
"codemirror_mode": {
345+
"name": "ipython",
346+
"version": 3
347+
},
348+
"file_extension": ".py",
349+
"mimetype": "text/x-python",
350+
"name": "python",
351+
"nbconvert_exporter": "python",
352+
"pygments_lexer": "ipython3"
353+
}
354+
},
355+
"nbformat": 4,
356+
"nbformat_minor": 2
357+
}

0 commit comments

Comments
 (0)