Skip to content

Commit 3f0e145

Browse files
Add LLMaaJ info to lm-eval tutorial (#59)
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
1 parent ffdf0aa commit 3f0e145

File tree

1 file changed

+161
-4
lines changed

1 file changed

+161
-4
lines changed

docs/modules/ROOT/pages/lm-eval-tutorial.adoc

Lines changed: 161 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -209,10 +209,12 @@ Specify the task using the Unitxt recipe format:
209209
* `systemPrompt`: Use `name` to specify a Unitxt catalog system prompt or `ref` to refer to a custom prompt:
210210
** `name`: Specify a Unitxt system prompt from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.system_prompts.__dir__.html++[Unitxt catalog]. Use the system prompt's ID as the value.
211211
** `ref`: Specify the reference name of a custom system prompt as defined in the `custom` section below
212-
* `task` (optional): Specify a Unitxt task from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.cards.__dir__.html++[Unitxt catalog]. Use the task's ID as the value.
213-
A Unitxt card has a pre-defined task. Only specify a value for this if you want to run different task.
214-
* `metrics` (optional): Specify a list of Unitx metrics from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.metrics.__dir__.html++[Unitxt catalog]. Use the metric's ID as the value.
215-
A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics.
212+
* `task` (optional): Specify a Unitxt task by `name` or `ref`. A Unitxt card has a pre-defined task. Only specify a value for this if you want to run different task.
213+
** `name`: from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.cards.__dir__.html++[Unitxt catalog]. Use the task's ID as the value.
214+
** `ref`: Specify the reference name of a custom task as defined in the `custom` section below
215+
* `metrics` (optional): Specify a list of Unitxt metrics by `name` or `ref`. A Unitxt task has a set of pre-defined metrics. Only specify a set of metrics if you need different metrics.
216+
** `name`: Specify a metric from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.metrics.__dir__.html++[Unitxt catalog]. Use the metric's ID as the value.
217+
** `ref`: Specify the reference name of a custom metric as defined in the `custom` section below
216218
* `format` (optional): Specify a Unitxt format from the link:++https://www.unitxt.ai/en/latest/catalog/catalog.formats.__dir__.html++[Unitxt catalog]. Use the format's ID as the value.
217219
* `loaderLimit` (optional): Specifies the maximum number of instances per stream to be returned from the loader (used to reduce loading time in large datasets).
218220
* `numDemos` (optional): Number of fewshot to be used.
@@ -239,6 +241,16 @@ Specify the task using the Unitxt recipe format:
239241
** `value`: A string for a custom Unitxt system prompt.
240242
The documentation link:https://www.unitxt.ai/en/latest/docs/adding_format.html#formats[here]
241243
provides an overview of the different components that make up a prompt format, including the system prompt.
244+
* `tasks`: Define custom tasks to use, each with a `name` and `value` field:
245+
** `name`: The name of this custom task that will be referenced in the `tasks.ref` field of a task recipe.
246+
** `value`: A JSON string for a custom Unitxt metric.
247+
Use the documentation link:https://www.unitxt.ai/en/latest/docs/adding_task.html[here]
248+
to compose a custom task, then use the documentation link:https://www.unitxt.ai/en/latest/docs/saving_and_loading_from_catalog.html[here] to store it as a JSON file and use the JSON content as the value of this field.
249+
* `metrics`: Define custom metrics to use, each with a `name` and `value` field:
250+
** `name`: The name of this custom metric that will be referenced in the `metrics.ref` field of a task recipe.
251+
** `value`: A JSON string for a custom Unitxt metric.
252+
Use the documentation link:https://www.unitxt.ai/en/latest/docs/adding_metric.html[here]
253+
to compose a custom metric, then use the documentation link:https://www.unitxt.ai/en/latest/docs/saving_and_loading_from_catalog.html[here] to store it as a JSON file and use the JSON content as the value of this field.
242254

243255

244256
|`numFewShot`
@@ -623,6 +635,151 @@ Inside the custom card, it uses the HuggingFace dataset loader:
623635

624636
You can use other link:https://www.unitxt.ai/en/latest/unitxt.loaders.html#module-unitxt.loaders[loaders] and use the `volumes` and `volumeMounts` to mount the dataset from persistent volumes. For example, if you use link:https://www.unitxt.ai/en/latest/unitxt.loaders.html#unitxt.loaders.LoadCSV[LoadCSV], you need to mount the files to the container and make the dataset accessible for the evaluation process.
625637

638+
=== LLM-as-a-Judge Evaluation
639+
640+
Certain LLM-as-a-Judge evaluations are possible by using custom Unitxt LLMaaJ metrics. See the link:https://www.unitxt.ai/en/latest/docs/llm_as_judge.html#llm-as-a-judge-metrics-guide[Unitxt metrics guide] for information on how to define custom metrics.
641+
642+
An example of a custom card and metric for LLMaaJ evaluation is given below.
643+
644+
[source,yaml]
645+
----
646+
apiVersion: trustyai.opendatahub.io/v1alpha1
647+
kind: LMEvalJob
648+
metadata:
649+
name: custom-llmaaj-metric
650+
spec:
651+
allowOnline: true
652+
allowCodeExecution: true
653+
model: hf
654+
modelArgs:
655+
- name: pretrained
656+
value: google/flan-t5-small
657+
taskList:
658+
taskRecipes:
659+
- card:
660+
custom: |
661+
{
662+
"__type__": "task_card",
663+
"loader": {
664+
"__type__": "load_hf",
665+
"path": "OfirArviv/mt_bench_single_score_gpt4_judgement",
666+
"split": "train"
667+
},
668+
"preprocess_steps": [
669+
{
670+
"__type__": "rename_splits",
671+
"mapper": {
672+
"train": "test"
673+
}
674+
},
675+
{
676+
"__type__": "filter_by_condition",
677+
"values": {
678+
"turn": 1
679+
},
680+
"condition": "eq"
681+
},
682+
{
683+
"__type__": "filter_by_condition",
684+
"values": {
685+
"reference": "[]"
686+
},
687+
"condition": "eq"
688+
},
689+
{
690+
"__type__": "rename",
691+
"field_to_field": {
692+
"model_input": "question",
693+
"score": "rating",
694+
"category": "group",
695+
"model_output": "answer"
696+
}
697+
},
698+
{
699+
"__type__": "literal_eval",
700+
"field": "question"
701+
},
702+
{
703+
"__type__": "copy",
704+
"field": "question/0",
705+
"to_field": "question"
706+
},
707+
{
708+
"__type__": "literal_eval",
709+
"field": "answer"
710+
},
711+
{
712+
"__type__": "copy",
713+
"field": "answer/0",
714+
"to_field": "answer"
715+
}
716+
],
717+
"task": "tasks.response_assessment.rating.single_turn",
718+
"templates": [
719+
"templates.response_assessment.rating.mt_bench_single_turn"
720+
]
721+
}
722+
template:
723+
ref: response_assessment.rating.mt_bench_single_turn
724+
format: formats.models.mistral.instruction
725+
metrics:
726+
- ref: llmaaj_metric
727+
custom:
728+
templates:
729+
- name: response_assessment.rating.mt_bench_single_turn
730+
value: |
731+
{
732+
"__type__": "input_output_template",
733+
"instruction": "Please act as an impartial judge and evaluate the quality of the response provided by an AI assistant to the user question displayed below. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of the response. Begin your evaluation by providing a short explanation. Be as objective as possible. After providing your explanation, you must rate the response on a scale of 1 to 10 by strictly following this format: \"[[rating]]\", for example: \"Rating: [[5]]\".\n\n",
734+
"input_format": "[Question]\n{question}\n\n[The Start of Assistant's Answer]\n{answer}\n[The End of Assistant's Answer]",
735+
"output_format": "[[{rating}]]",
736+
"postprocessors": [
737+
"processors.extract_mt_bench_rating_judgment"
738+
]
739+
}
740+
tasks:
741+
- name: response_assessment.rating.single_turn
742+
value: |
743+
{
744+
"__type__": "task",
745+
"input_fields": {
746+
"question": "str",
747+
"answer": "str"
748+
},
749+
"outputs": {
750+
"rating": "float"
751+
},
752+
"metrics": [
753+
"metrics.spearman"
754+
]
755+
}
756+
metrics:
757+
- name: llmaaj_metric
758+
value: |
759+
{
760+
"__type__": "llm_as_judge",
761+
"inference_model": {
762+
"__type__": "hf_pipeline_based_inference_engine",
763+
"model_name": "mistralai/Mistral-7B-Instruct-v0.2",
764+
"max_new_tokens": 256,
765+
"use_fp16": true
766+
},
767+
"template": "templates.response_assessment.rating.mt_bench_single_turn",
768+
"task": "rating.single_turn",
769+
"format": "formats.models.mistral.instruction",
770+
"main_score": "mistral_7b_instruct_v0_2_huggingface_template_mt_bench_single_turn"
771+
}
772+
logSamples: true
773+
pod:
774+
container:
775+
env:
776+
- name: HF_TOKEN
777+
value: <HF_TOKEN>
778+
----
779+
780+
There are also a handful of pre-existing link:https://www.unitxt.ai/en/latest/catalog/catalog.metrics.llm_as_judge.__dir__.html[LLMaaJ metrics] available in the Unitxt catalog.
781+
782+
626783
=== Using PVCs as storage
627784

628785
To use a PVC as storage for the `LMEvalJob` results, there are two supported modes, at the moment, managed and existing PVCs.

0 commit comments

Comments
 (0)