Skip to content

Commit 42ab3d1

Browse files
authored
Added JSL-Med Benchmark (#1617)
* Added MedS benchmark * added JSL-MedS benchmark
1 parent 6c8a516 commit 42ab3d1

30 files changed

+1543
-14
lines changed

docs/_posts/Cabir40/2024-07-12-jsl_meds_q16_v1_en.md

Lines changed: 63 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -13,12 +13,12 @@ supported: true
1313
annotator: LLMLoader
1414
article_header:
1515
type: cover
16-
use_language_switcher: "Python-Scala-Java"
17-
18-
deploy:
19-
sagemaker_link: https://aws.amazon.com/marketplace/pp/prodview-yrajldynampw4
20-
snowflake_link: https://app.snowflake.com/marketplace/listing/GZTYZ4386LJ68/john-snow-labs-medical-text-summarization-and-qa
21-
databricks_link:
16+
use_language_switcher: "Python-Scala-Java"
17+
18+
deploy:
19+
sagemaker_link: https://aws.amazon.com/marketplace/pp/prodview-yrajldynampw4
20+
snowflake_link: https://app.snowflake.com/marketplace/listing/GZTYZ4386LJ68/john-snow-labs-medical-text-summarization-and-qa
21+
databricks_link:
2222

2323
---
2424

@@ -38,13 +38,13 @@ This LLM model is trained to perform Summarization and Q&A based on a given cont
3838
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/models/jsl_meds_q16_v1_en_5.4.0_3.0_1720040078717.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
3939
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/models/jsl_meds_q16_v1_en_5.4.0_3.0_1720040078717.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
4040

41-
{% if page.deploy %}
42-
## Available as Private API Endpoint
43-
44-
{:.tac}
45-
{% include display_platform_information.html %}
46-
{% endif %}
47-
41+
{% if page.deploy %}
42+
## Available as Private API Endpoint
43+
44+
{:.tac}
45+
{% include display_platform_information.html %}
46+
{% endif %}
47+
4848
## How to use
4949

5050

@@ -116,3 +116,53 @@ val response = llmLoader.generate(prompt)
116116

117117

118118

119+
## Benchmarking
120+
121+
We have generated a total of 400 questions, 100 from each category. These questions were labeled and reviewed by 3 physician annotators. `%` indicates the preference rate
122+
123+
```bash
124+
## Overall
125+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
126+
|------------|--------------|----------------------|---------------|
127+
| JSL-MedS | 0.24 | 0.25 | 0.38 |
128+
| GPT4o | 0.19 | 0.26 | 0.27 |
129+
| Neutral | 0.43 | 0.36 | 0.18 |
130+
| None | 0.14 | 0.13 | 0.17 |
131+
| Total | 1.00 | 1.00 | 1.00 |
132+
133+
## Summary
134+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
135+
|------------|--------------|----------------------|---------------|
136+
| JSL-MedS | 0.47 | 0.48 | 0.42 |
137+
| GPT4o | 0.25 | 0.25 | 0.25 |
138+
| Neutral | 0.22 | 0.22 | 0.25 |
139+
| None | 0.07 | 0.05 | 0.08 |
140+
| Total | 1.00 | 1.00 | 1.00 |
141+
142+
## QA
143+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
144+
|------------|--------------|----------------------|---------------|
145+
| JSL-MedS | 0.35 | 0.36 | 0.42 |
146+
| GPT4o | 0.24 | 0.24 | 0.29 |
147+
| Neutral | 0.33 | 0.33 | 0.18 |
148+
| None | 0.09 | 0.07 | 0.11 |
149+
| Total | 1.00 | 1.00 | 1.00 |
150+
151+
## BioMedical
152+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
153+
|------------|--------------|----------------------|---------------|
154+
| JSL-MedS | 0.33 | 0.24 | 0.57 |
155+
| GPT4o | 0.12 | 0.08 | 0.16 |
156+
| Neutral | 0.45 | 0.57 | 0.16 |
157+
| None | 0.10 | 0.10 | 0.10 |
158+
| Total | 1.00 | 1.00 | 1.00 |
159+
160+
## OpenEnded
161+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
162+
|------------|--------------|----------------------|---------------|
163+
| JSL-MedS | 0.35 | 0.30 | 0.39 |
164+
| GPT4o | 0.30 | 0.33 | 0.41 |
165+
| Neutral | 0.19 | 0.20 | 0.02 |
166+
| None | 0.17 | 0.17 | 0.19 |
167+
| Total | 1.00 | 1.00 | 1.00 |
168+
```

docs/_posts/Cabir40/2024-07-12-jsl_meds_q4_v1_en.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,3 +103,53 @@ val response = llmLoader.generate(prompt)
103103

104104

105105

106+
## Benchmarking
107+
108+
We have generated a total of 400 questions, 100 from each category. These questions were labeled and reviewed by 3 physician annotators. `%` indicates the preference rate
109+
110+
```bash
111+
## Overall
112+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
113+
|------------|--------------|----------------------|---------------|
114+
| JSL-MedS | 0.24 | 0.25 | 0.38 |
115+
| GPT4o | 0.19 | 0.26 | 0.27 |
116+
| Neutral | 0.43 | 0.36 | 0.18 |
117+
| None | 0.14 | 0.13 | 0.17 |
118+
| Total | 1.00 | 1.00 | 1.00 |
119+
120+
## Summary
121+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
122+
|------------|--------------|----------------------|---------------|
123+
| JSL-MedS | 0.47 | 0.48 | 0.42 |
124+
| GPT4o | 0.25 | 0.25 | 0.25 |
125+
| Neutral | 0.22 | 0.22 | 0.25 |
126+
| None | 0.07 | 0.05 | 0.08 |
127+
| Total | 1.00 | 1.00 | 1.00 |
128+
129+
## QA
130+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
131+
|------------|--------------|----------------------|---------------|
132+
| JSL-MedS | 0.35 | 0.36 | 0.42 |
133+
| GPT4o | 0.24 | 0.24 | 0.29 |
134+
| Neutral | 0.33 | 0.33 | 0.18 |
135+
| None | 0.09 | 0.07 | 0.11 |
136+
| Total | 1.00 | 1.00 | 1.00 |
137+
138+
## BioMedical
139+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
140+
|------------|--------------|----------------------|---------------|
141+
| JSL-MedS | 0.33 | 0.24 | 0.57 |
142+
| GPT4o | 0.12 | 0.08 | 0.16 |
143+
| Neutral | 0.45 | 0.57 | 0.16 |
144+
| None | 0.10 | 0.10 | 0.10 |
145+
| Total | 1.00 | 1.00 | 1.00 |
146+
147+
## OpenEnded
148+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
149+
|------------|--------------|----------------------|---------------|
150+
| JSL-MedS | 0.35 | 0.30 | 0.39 |
151+
| GPT4o | 0.30 | 0.33 | 0.41 |
152+
| Neutral | 0.19 | 0.20 | 0.02 |
153+
| None | 0.17 | 0.17 | 0.19 |
154+
| Total | 1.00 | 1.00 | 1.00 |
155+
```

docs/_posts/Cabir40/2024-07-12-jsl_meds_q8_v1_en.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -103,3 +103,53 @@ val response = llmLoader.generate(prompt)
103103

104104

105105

106+
## Benchmarking
107+
108+
We have generated a total of 400 questions, 100 from each category. These questions were labeled and reviewed by 3 physician annotators. `%` indicates the preference rate
109+
110+
```bash
111+
## Overall
112+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
113+
|------------|--------------|----------------------|---------------|
114+
| JSL-MedS | 0.24 | 0.25 | 0.38 |
115+
| GPT4o | 0.19 | 0.26 | 0.27 |
116+
| Neutral | 0.43 | 0.36 | 0.18 |
117+
| None | 0.14 | 0.13 | 0.17 |
118+
| Total | 1.00 | 1.00 | 1.00 |
119+
120+
## Summary
121+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
122+
|------------|--------------|----------------------|---------------|
123+
| JSL-MedS | 0.47 | 0.48 | 0.42 |
124+
| GPT4o | 0.25 | 0.25 | 0.25 |
125+
| Neutral | 0.22 | 0.22 | 0.25 |
126+
| None | 0.07 | 0.05 | 0.08 |
127+
| Total | 1.00 | 1.00 | 1.00 |
128+
129+
## QA
130+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
131+
|------------|--------------|----------------------|---------------|
132+
| JSL-MedS | 0.35 | 0.36 | 0.42 |
133+
| GPT4o | 0.24 | 0.24 | 0.29 |
134+
| Neutral | 0.33 | 0.33 | 0.18 |
135+
| None | 0.09 | 0.07 | 0.11 |
136+
| Total | 1.00 | 1.00 | 1.00 |
137+
138+
## BioMedical
139+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
140+
|------------|--------------|----------------------|---------------|
141+
| JSL-MedS | 0.33 | 0.24 | 0.57 |
142+
| GPT4o | 0.12 | 0.08 | 0.16 |
143+
| Neutral | 0.45 | 0.57 | 0.16 |
144+
| None | 0.10 | 0.10 | 0.10 |
145+
| Total | 1.00 | 1.00 | 1.00 |
146+
147+
## OpenEnded
148+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
149+
|------------|--------------|----------------------|---------------|
150+
| JSL-MedS | 0.35 | 0.30 | 0.39 |
151+
| GPT4o | 0.30 | 0.33 | 0.41 |
152+
| Neutral | 0.19 | 0.20 | 0.02 |
153+
| None | 0.17 | 0.17 | 0.19 |
154+
| Total | 1.00 | 1.00 | 1.00 |
155+
```

docs/_posts/Cabir40/2024-07-12-jsl_medsner_zs_q16_v1_en.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,53 @@ val response = llmLoader.generate(prompt)
127127

128128

129129

130+
## Benchmarking
131+
132+
We have generated a total of 400 questions, 100 from each category. These questions were labeled and reviewed by 3 physician annotators. `%` indicates the preference rate
133+
134+
```bash
135+
## Overall
136+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
137+
|------------|--------------|----------------------|---------------|
138+
| JSL-MedS | 0.24 | 0.25 | 0.38 |
139+
| GPT4o | 0.19 | 0.26 | 0.27 |
140+
| Neutral | 0.43 | 0.36 | 0.18 |
141+
| None | 0.14 | 0.13 | 0.17 |
142+
| Total | 1.00 | 1.00 | 1.00 |
143+
144+
## Summary
145+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
146+
|------------|--------------|----------------------|---------------|
147+
| JSL-MedS | 0.47 | 0.48 | 0.42 |
148+
| GPT4o | 0.25 | 0.25 | 0.25 |
149+
| Neutral | 0.22 | 0.22 | 0.25 |
150+
| None | 0.07 | 0.05 | 0.08 |
151+
| Total | 1.00 | 1.00 | 1.00 |
152+
153+
## QA
154+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
155+
|------------|--------------|----------------------|---------------|
156+
| JSL-MedS | 0.35 | 0.36 | 0.42 |
157+
| GPT4o | 0.24 | 0.24 | 0.29 |
158+
| Neutral | 0.33 | 0.33 | 0.18 |
159+
| None | 0.09 | 0.07 | 0.11 |
160+
| Total | 1.00 | 1.00 | 1.00 |
161+
162+
## BioMedical
163+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
164+
|------------|--------------|----------------------|---------------|
165+
| JSL-MedS | 0.33 | 0.24 | 0.57 |
166+
| GPT4o | 0.12 | 0.08 | 0.16 |
167+
| Neutral | 0.45 | 0.57 | 0.16 |
168+
| None | 0.10 | 0.10 | 0.10 |
169+
| Total | 1.00 | 1.00 | 1.00 |
170+
171+
## OpenEnded
172+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
173+
|------------|--------------|----------------------|---------------|
174+
| JSL-MedS | 0.35 | 0.30 | 0.39 |
175+
| GPT4o | 0.30 | 0.33 | 0.41 |
176+
| Neutral | 0.19 | 0.20 | 0.02 |
177+
| None | 0.17 | 0.17 | 0.19 |
178+
| Total | 1.00 | 1.00 | 1.00 |
179+
```

docs/_posts/Cabir40/2024-07-12-jsl_medsner_zs_q4_v1_en.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,53 @@ val response = llmLoader.generate(prompt)
127127

128128

129129

130+
## Benchmarking
131+
132+
We have generated a total of 400 questions, 100 from each category. These questions were labeled and reviewed by 3 physician annotators. `%` indicates the preference rate
133+
134+
```bash
135+
## Overall
136+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
137+
|------------|--------------|----------------------|---------------|
138+
| JSL-MedS | 0.24 | 0.25 | 0.38 |
139+
| GPT4o | 0.19 | 0.26 | 0.27 |
140+
| Neutral | 0.43 | 0.36 | 0.18 |
141+
| None | 0.14 | 0.13 | 0.17 |
142+
| Total | 1.00 | 1.00 | 1.00 |
143+
144+
## Summary
145+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
146+
|------------|--------------|----------------------|---------------|
147+
| JSL-MedS | 0.47 | 0.48 | 0.42 |
148+
| GPT4o | 0.25 | 0.25 | 0.25 |
149+
| Neutral | 0.22 | 0.22 | 0.25 |
150+
| None | 0.07 | 0.05 | 0.08 |
151+
| Total | 1.00 | 1.00 | 1.00 |
152+
153+
## QA
154+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
155+
|------------|--------------|----------------------|---------------|
156+
| JSL-MedS | 0.35 | 0.36 | 0.42 |
157+
| GPT4o | 0.24 | 0.24 | 0.29 |
158+
| Neutral | 0.33 | 0.33 | 0.18 |
159+
| None | 0.09 | 0.07 | 0.11 |
160+
| Total | 1.00 | 1.00 | 1.00 |
161+
162+
## BioMedical
163+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
164+
|------------|--------------|----------------------|---------------|
165+
| JSL-MedS | 0.33 | 0.24 | 0.57 |
166+
| GPT4o | 0.12 | 0.08 | 0.16 |
167+
| Neutral | 0.45 | 0.57 | 0.16 |
168+
| None | 0.10 | 0.10 | 0.10 |
169+
| Total | 1.00 | 1.00 | 1.00 |
170+
171+
## OpenEnded
172+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
173+
|------------|--------------|----------------------|---------------|
174+
| JSL-MedS | 0.35 | 0.30 | 0.39 |
175+
| GPT4o | 0.30 | 0.33 | 0.41 |
176+
| Neutral | 0.19 | 0.20 | 0.02 |
177+
| None | 0.17 | 0.17 | 0.19 |
178+
| Total | 1.00 | 1.00 | 1.00 |
179+
```

docs/_posts/Cabir40/2024-07-12-jsl_medsner_zs_q8_v1_en.md

Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,3 +127,53 @@ val response = llmLoader.generate(prompt)
127127

128128

129129

130+
## Benchmarking
131+
132+
We have generated a total of 400 questions, 100 from each category. These questions were labeled and reviewed by 3 physician annotators. `%` indicates the preference rate
133+
134+
```bash
135+
## Overall
136+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
137+
|------------|--------------|----------------------|---------------|
138+
| JSL-MedS | 0.24 | 0.25 | 0.38 |
139+
| GPT4o | 0.19 | 0.26 | 0.27 |
140+
| Neutral | 0.43 | 0.36 | 0.18 |
141+
| None | 0.14 | 0.13 | 0.17 |
142+
| Total | 1.00 | 1.00 | 1.00 |
143+
144+
## Summary
145+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
146+
|------------|--------------|----------------------|---------------|
147+
| JSL-MedS | 0.47 | 0.48 | 0.42 |
148+
| GPT4o | 0.25 | 0.25 | 0.25 |
149+
| Neutral | 0.22 | 0.22 | 0.25 |
150+
| None | 0.07 | 0.05 | 0.08 |
151+
| Total | 1.00 | 1.00 | 1.00 |
152+
153+
## QA
154+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
155+
|------------|--------------|----------------------|---------------|
156+
| JSL-MedS | 0.35 | 0.36 | 0.42 |
157+
| GPT4o | 0.24 | 0.24 | 0.29 |
158+
| Neutral | 0.33 | 0.33 | 0.18 |
159+
| None | 0.09 | 0.07 | 0.11 |
160+
| Total | 1.00 | 1.00 | 1.00 |
161+
162+
## BioMedical
163+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
164+
|------------|--------------|----------------------|---------------|
165+
| JSL-MedS | 0.33 | 0.24 | 0.57 |
166+
| GPT4o | 0.12 | 0.08 | 0.16 |
167+
| Neutral | 0.45 | 0.57 | 0.16 |
168+
| None | 0.10 | 0.10 | 0.10 |
169+
| Total | 1.00 | 1.00 | 1.00 |
170+
171+
## OpenEnded
172+
| Model | Factuality % | Clinical Relevancy % | Conciseness % |
173+
|------------|--------------|----------------------|---------------|
174+
| JSL-MedS | 0.35 | 0.30 | 0.39 |
175+
| GPT4o | 0.30 | 0.33 | 0.41 |
176+
| Neutral | 0.19 | 0.20 | 0.02 |
177+
| None | 0.17 | 0.17 | 0.19 |
178+
| Total | 1.00 | 1.00 | 1.00 |
179+
```

0 commit comments

Comments
 (0)