You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/modules/ROOT/pages/lm-eval-tutorial.adoc
+161-4Lines changed: 161 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -209,10 +209,12 @@ Specify the task using the Unitxt recipe format:
209
209
* `systemPrompt`: Use `name` to specify a Unitxt catalog system prompt or `ref` to refer to a custom prompt:
210
210
** `name`: Specify a Unitxt system prompt from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.system_prompts.__dir__.html++[Unitxt catalog]. Use the system prompt's ID as the value.
211
211
** `ref`: Specify the reference name of a custom system prompt as defined in the `custom` section below
212
-
* `task` (optional): Specify a Unitxt task from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.cards.__dir__.html++[Unitxt catalog]. Use the task's ID as the value.
213
-
A Unitxt card has a pre-defined task. Only specify a value for this if you want to run different task.
214
-
* `metrics` (optional): Specify a list of Unitx metrics from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.metrics.__dir__.html++[Unitxt catalog]. Use the metric's ID as the value.
215
-
A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics.
212
+
* `task` (optional): Specify a Unitxt task by `name` or `ref`. A Unitxt card has a pre-defined task. Only specify a value for this if you want to run different task.
213
+
** `name`: from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.cards.__dir__.html++[Unitxt catalog]. Use the task's ID as the value.
214
+
** `ref`: Specify the reference name of a custom task as defined in the `custom` section below
215
+
* `metrics` (optional): Specify a list of Unitxt metrics by `name` or `ref`. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics.
216
+
** `name`: Specify a metric from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.metrics.__dir__.html++[Unitxt catalog]. Use the metric's ID as the value.
217
+
** `ref`: Specify the reference name of a custom metric as defined in the `custom` section below
216
218
* `format` (optional): Specify a Unitxt format from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.formats.__dir__.html++[Unitxt catalog]. Use the format's ID as the value.
217
219
* `loaderLimit` (optional): Specifies the maximum number of instances per stream to be returned from the loader (used to reduce loading time in large datasets).
218
220
* `numDemos` (optional): Number of fewshot to be used.
@@ -239,6 +241,16 @@ Specify the task using the Unitxt recipe format:
239
241
** `value`: A string for a custom Unitxt system prompt.
240
242
The documentation link:https://www.unitxt.ai/en/latest/docs/adding_format.html#formats[here]
241
243
provides an overview of the different components that make up a prompt format, including the system prompt.
244
+
* `tasks`: Define custom tasks to use, each with a `name` and `value` field:
245
+
** `name`: The name of this custom task that will be referenced in the `tasks.ref` field of a task recipe.
246
+
** `value`: A JSON string for a custom Unitxt metric.
247
+
Use the documentation link:https://www.unitxt.ai/en/latest/docs/adding_task.html[here]
248
+
to compose a custom task, then use the documentation link:https://www.unitxt.ai/en/latest/docs/saving_and_loading_from_catalog.html[here] to store it as a JSON file and use the JSON content as the value of this field.
249
+
* `metrics`: Define custom metrics to use, each with a `name` and `value` field:
250
+
** `name`: The name of this custom metric that will be referenced in the `metrics.ref` field of a task recipe.
251
+
** `value`: A JSON string for a custom Unitxt metric.
252
+
Use the documentation link:https://www.unitxt.ai/en/latest/docs/adding_metric.html[here]
253
+
to compose a custom metric, then use the documentation link:https://www.unitxt.ai/en/latest/docs/saving_and_loading_from_catalog.html[here] to store it as a JSON file and use the JSON content as the value of this field.
242
254
243
255
244
256
|`numFewShot`
@@ -623,6 +635,151 @@ Inside the custom card, it uses the HuggingFace dataset loader:
623
635
624
636
You can use other link:https://www.unitxt.ai/en/latest/unitxt.loaders.html#module-unitxt.loaders[loaders] and use the `volumes` and `volumeMounts` to mount the dataset from persistent volumes. For example, if you use link:https://www.unitxt.ai/en/latest/unitxt.loaders.html#unitxt.loaders.LoadCSV[LoadCSV], you need to mount the files to the container and make the dataset accessible for the evaluation process.
625
637
638
+
=== LLM-as-a-Judge Evaluation
639
+
640
+
Certain LLM-as-a-Judge evaluations are possible by using custom Unitxt LLMaaJ metrics. See the link:https://www.unitxt.ai/en/latest/docs/llm_as_judge.html#llm-as-a-judge-metrics-guide[Unitxt metrics guide] for information on how to define custom metrics.
641
+
642
+
An example of a custom card and metric for LLMaaJ evaluation is given below.
"instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n",
734
+
"input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]",
There are also a handful of pre-existing link:https://www.unitxt.ai/en/latest/catalog/catalog.metrics.llm_as_judge.__dir__.html[LLMaaJ metrics] available in the Unitxt catalog.
781
+
782
+
626
783
=== Using PVCs as storage
627
784
628
785
To use a PVC as storage for the `LMEvalJob` results, there are two supported modes, at the moment, managed and existing PVCs.
0 commit comments