Respond to PR comments

Abby Hartman · Abby Hartman · commit c56cc08b3f7d · 2025-05-01T10:22:44.000-07:00
diff --git a/scenarios/evaluate/Supported_Evaluation_Metrics/Document_Retrieval_Evaluation/Document_Retrieval_Evaluation_Full_Sample.ipynb b/scenarios/evaluate/Supported_Evaluation_Metrics/Document_Retrieval_Evaluation/Document_Retrieval_Evaluation_Full_Sample.ipynb
@@ -12,12 +12,13 @@
     "\n",
     "### Explanation of Document Retrieval Metrics \n",
     "The metrics that will be generated in the output of the evaluator include:\n",
-    "* NDCG (Normalized Discounted Cumulative Gain) calculated for the top 3 documents retrieved from a search query.  NDCG measures how well a document ranking compares to an ideal document ranking given a list of ground-truth documents.\n",
-    "* XDCG calculated for the top 3 documents retrieved from a search query.  XDCG measures how objectively good are the top K documents, discounted by their position in the list.\n",
-    "* Fidelity calculated over all documents retrieved from a search query.  Fidelity measures how objectively good are all of the documents retrieved compared with all known good documents in the underlying data store.\n",
-    "* Top 1 relevance, which is the top relevance score for a given set of retrieved documents.\n",
-    "* Top 3 max relevance, which is the maximum relevance score among the top 3 documents for a given set of retrieved documents.\n",
-    "* Holes and holes ratio, which measure the number of retrieved documents for which a ground truth label is missing, and the proportion of this count within the total number of retrieved documents, respectively.\n",
+    "| Metric               | Category            | Description                                                                                     |\n",
+    "|-----------------------|---------------------|-------------------------------------------------------------------------------------------------|\n",
+    "| Fidelity             | Search Fidelity    | How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset |\n",
+    "| NDCG                 | Search NDCG        | How good are the rankings to an ideal order where all relevant items are at the top of the list.        |\n",
+    "| XDCG                 | Search XDCG        | How good the results are in the top-k documents regardless of scoring of other index documents |\n",
+    "| Max Relevance N      | Search Max Relevance | Maximum relevance in the top-k chunks                                                          |\n",
+    "| Holes      | Search Label Sanity | Number of documents with missing query relevance judgments (Ground truth) |\n",
     "\n",
     "It's important to note that some metrics, particularly NDCG, XDCG and Fidelity, are sensitive to holes.  Ideally the count of holes for a given evaluation should be zero, otherwise results for these metrics may not be accurate.  It is recommended to iteratively check results against current known ground truth to fill holes to improve accuracy of the evaluation metrics.  This process is not covered explicitly in the sample but is important to mention."
    ]
diff --git a/scenarios/evaluate/Supported_Evaluation_Metrics/Document_Retrieval_Evaluation/Document_Retrieval_Evaluation_Short_Sample.ipynb b/scenarios/evaluate/Supported_Evaluation_Metrics/Document_Retrieval_Evaluation/Document_Retrieval_Evaluation_Short_Sample.ipynb
@@ -14,12 +14,14 @@
     "\n",
     "### Explanation of Document Retrieval Metrics \n",
     "The metrics that will be generated in the output of the evaluator include:\n",
-    "* NDCG (Normalized Discounted Cumulative Gain) calculated for the top 3 documents retrieved from a search query.  NDCG measures how well a document ranking compares to an ideal document ranking given a list of ground-truth documents.\n",
-    "* XDCG calculated for the top 3 documents retrieved from a search query.  XDCG measures how objectively good are the top K documents, discounted by their position in the list.\n",
-    "* Fidelity calculated over all documents retrieved from a search query.  Fidelity measures how objectively good are all of the documents retrieved compared with all known good documents in the underlying data store.\n",
-    "* Top 1 relevance, which is the top relevance score for a given set of retrieved documents.\n",
-    "* Top 3 max relevance, which is the maximum relevance score among the top 3 documents for a given set of retrieved documents.\n",
-    "* Holes and holes ratio, which measure the number of retrieved documents for which a ground truth label is missing, and the proportion of this count within the total number of retrieved documents, respectively.\n",
+    "\n",
+    "| Metric               | Category            | Description                                                                                     |\n",
+    "|-----------------------|---------------------|-------------------------------------------------------------------------------------------------|\n",
+    "| Fidelity             | Search Fidelity    | How well the top n retrieved chunks reflect the content for a given query; number of good documents returned out of the total number of known good documents in a dataset |\n",
+    "| NDCG                 | Search NDCG        | How good are the rankings to an ideal order where all relevant items are at the top of the list.        |\n",
+    "| XDCG                 | Search XDCG        | How good the results are in the top-k documents regardless of scoring of other index documents |\n",
+    "| Max Relevance N      | Search Max Relevance | Maximum relevance in the top-k chunks                                                          |\n",
+    "| Holes      | Search Label Sanity | Number of documents with missing query relevance judgments (Ground truth) | \n",
     "\n",
     "It's important to note that some metrics, particularly NDCG, XDCG and Fidelity, are sensitive to holes.  Ideally the count of holes for a given evaluation should be zero, otherwise results for these metrics may not be accurate.  It is recommended to iteratively check results against current known ground truth to fill holes to improve accuracy of the evaluation metrics.  This process is not covered explicitly in the sample but is important to mention."
    ]