Skip to content

Commit 57dfbcc

Browse files
Merge pull request #60 from christinaexyou/add-lmeval-lls-tutorial
Add LMEval LLS tutorial
2 parents 3f0e145 + 9178d4f commit 57dfbcc

File tree

3 files changed

+187
-0
lines changed

3 files changed

+187
-0
lines changed

docs/modules/ROOT/nav.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
** xref:lm-eval-tutorial.adoc[]
1616
*** xref:lm-eval-tutorial-toxicity.adoc[Toxicity Measurement]
1717
** xref:gorch-tutorial.adoc[]
18+
** xref:lm-eval-lls-tutorial.adoc[]
1819
* Components
1920
** xref:trustyai-service.adoc[]
2021
** xref:trustyai-operator.adoc[]
Lines changed: 185 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,185 @@
1+
= Getting Started with LMEval Llama Stack External Eval Provider
2+
:description: Learn how to evaluate your language model using the LMEval Llama Stack External Eval Provider.
3+
:keywords: LMEval, Llama Stack, model evaluation
4+
5+
== Prerequisites
6+
7+
* Admin access to an OpenShift cluster
8+
* The TrustyAI operator installed in your OpenShift cluster
9+
* KServe set to Raw Deployment mode
10+
* A language model deployed on vLLM Serving Runtime in your OpenShift cluster
11+
12+
== Overview
13+
14+
This tutorial demonstrates how to evaluate a language model using the LMEval Llama Stack External Eval Provider. You will learn how to:
15+
16+
* Configure a Llama Stack server to use the LMEval Eval provider
17+
* Register a benchmark dataset
18+
* Run a benchmark evaluation job on a language model
19+
20+
== Usage
21+
Create and activate a virtual environment:
22+
23+
[source,bash]
24+
----
25+
python3 -m venv .venv
26+
source .venv/bin/activate
27+
----
28+
29+
Install the LMEval Llama Stack External Eval Provider from PyPi:
30+
31+
[source,bash]
32+
----
33+
pip install llama-stack-provider-lmeval
34+
----
35+
36+
== Configuing the Llama Stack Server
37+
Set the `VLLM_URL` and `TRUSTYAI_LM_EVAL_NAMESPACE` environment variables in your terminal. The `VLLM_URL` value should be the `v1/completions` endpoint of your model route and the `TRUSTYAI_LM_EVAL_NAMESPACE` should be the namespace where your model is deployed. For example:
38+
39+
[source,bash]
40+
----
41+
export VLLM_URL=https://$(oc get $(oc get ksvc -o name | grep predictor) --template={{.status.url}})/v1/completions
42+
43+
export TRUSTYAI_LM_EVAL_NAMESPACE=$(oc project | cut -d '"' -f2)
44+
----
45+
46+
Download the `providers.d` directory and the `run.yaml` file:
47+
48+
[source, bash]
49+
----
50+
curl --create-dirs --output providers.d/remote/eval/trustyai_lmeval.yaml https://raw.githubusercontent.com/trustyai-explainability/llama-stack-provider-lmeval/refs/heads/main/providers.d/remote/eval/trustyai_lmeval.yaml
51+
52+
curl --create-dirs --output run.yaml https://raw.githubusercontent.com/trustyai-explainability/llama-stack-provider-lmeval/refs/heads/main/run.yaml
53+
----
54+
55+
Start the Llama Stack server in a virtual environment:
56+
57+
[source,bash]
58+
----
59+
llama stack run run.yaml --image-type venv
60+
----
61+
62+
This will start a Llama Stack Server which will use port 8321 by default.
63+
64+
== Running an Evaluation
65+
66+
With the Llama Stack server running, create a Python script or Jupyter notebook to interact with the server and run an evaluation.
67+
68+
Import the necessary libraries and modules:
69+
[source, python]
70+
----
71+
import os
72+
import subprocess
73+
74+
import logging
75+
76+
import time
77+
import pprint
78+
----
79+
80+
81+
Instantiate the Llama Stack Python client to interact with the running Llama Stack server:
82+
83+
[source, python]
84+
----
85+
BASE_URL = "http://localhost:8321"
86+
87+
def create_http_client():
88+
from llama_stack_client import LlamaStackClient
89+
return LlamaStackClient(base_url=BASE_URL)
90+
91+
client = create_http_client()
92+
----
93+
94+
Check the current list of available benchmarks:
95+
96+
[source, python]
97+
----
98+
benchmarks = client.benchmarks.list()
99+
100+
pprint.print(f"Available benchmarks: {benchmarks}")
101+
----
102+
103+
Register the ARC-Easy, a dataset of grade-school level, multiple-choice science questions, as a benchmark:
104+
105+
[source, python]
106+
----
107+
client.benchmarks.register(
108+
benchmark_id="trustyai_lmeval::arc_easy",
109+
dataset_id="trustyai_lmeval::arc_easy",
110+
scoring_functions=["string"],
111+
provider_benchmark_id="string",
112+
provider_id="trustyai_lmeval",
113+
metadata={
114+
"tokenizer": "google/flan-t5-small"
115+
"tokenized_requests": False,
116+
}
117+
)
118+
----
119+
[NOTE]
120+
LMEval comes with 100+ out-of-the-box datasets for evaluation so feel free to experiment.
121+
122+
Verify that the benchmark has been registered successfully:
123+
124+
[source, python]
125+
----
126+
benchmarks = client.benchmarks.list()
127+
128+
pprint.print(f"Available benchmarks: {benchmarks}")
129+
----
130+
131+
Run a benchmark evaluation on your model:
132+
133+
[source, python]
134+
----
135+
job = client.eval.run_eval(
136+
benchmark_id="trustyai_lmeval::arc_easy",
137+
benchmark_config={
138+
"eval_candidate": {
139+
"type": "model",
140+
"model": "phi-3",
141+
"provider_id": "trustyai_lmeval",
142+
"sampling_params": {
143+
"temperature": 0.7,
144+
"top_p": 0.9,
145+
"max_tokens": 256
146+
},
147+
},
148+
"num_examples": 1000,
149+
},
150+
)
151+
152+
print(f"Starting job '{job.job_id}'")
153+
----
154+
[NOTE]
155+
The `eval_candidate` section specifies the model to be evaluated, in this case, "phi-3". Replace it with the name of your deployed model.
156+
157+
158+
Monitor the status of the evaluation job. The job will run asynchronously, so you can check its status periodically:
159+
[source, python]
160+
----
161+
def get_job_status(job_id, benchmark_id):
162+
return client.eval.jobs.status(job_id=job_id, benchmark_id=benchmark_id)
163+
164+
while True:
165+
job = get_job_status(job_id=job.job_id, benchmark_id="trustyai_lmeval::arc_easy")
166+
print(job)
167+
168+
if job.status in ['failed', 'completed']:
169+
print(f"Job ended with status: {job.status}")
170+
break
171+
172+
time.sleep(20)
173+
----
174+
175+
Once the job status reports back as `completed`, get the results of the evaluation job:
176+
177+
[source, python]
178+
----
179+
pprint.pprint(client.eval.jobs.retrieve(job_id=job.job_id, benchmark_id="trustyai_lmeval::arc_easy").scores)
180+
----
181+
182+
== Additional Resources
183+
* This tutorial provides a high level overview of how to use the LMEval Llama Stack External Eval Provider to evaluate language models. For a fulll end-to-end demo with explanations and output, please refer to https://github.com/trustyai-explainability/llama-stack-provider-lmeval/tree/main/demos[the official demos].
184+
185+
* If you have any questions or improvements to contribute, please feel free to open an issue or a pull request on https://github.com/trustyai-explainability/llama-stack-provider-lmeval[the project's GitHub repository].

docs/modules/ROOT/pages/main.adoc

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,7 @@ TrustyAI is a default component of https://opendatahub.io/[Open Data Hub] and ht
2020
* xref:python-trustyai.adoc[Python TrustyAI], a Python library allowing the usage of TrustyAI's toolkit from Jupyter notebooks
2121
* xref:component-kserve-explainer.adoc[KServe explainer], a TrustyAI side-car that integrates with KServe's built-in explainability features.
2222
* xref:component-lm-eval.adoc[LM-Eval], generative text model benchmark and evaluation service, leveraging lm-evaluation-harness and Unitxt
23+
* xref:component-gorch.adoc[Guardrails], generative text model guardrailing service, leveraging fms-guardrails-orchestrator
2324

2425

2526

0 commit comments

Comments
 (0)