Skip to content

Commit d2281bb

Browse files
authored
upload new benchmark table (#1669)
* upload new benchmark table * upload llm bencmark table * update typo * Update benchmark.md
1 parent c873925 commit d2281bb

File tree

2 files changed

+54
-1
lines changed

2 files changed

+54
-1
lines changed

docs/en/benchmark.md

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -659,6 +659,34 @@ deid_pipeline = Pipeline().setStages([
659659

660660
PS: The reason why pipelines with the same stages have different costs is due to the layers of the NER model and the hardcoded regexes in Deidentification.
661661

662+
663+
- ZeroShot Deidentification Pipelines Speed Comparison
664+
665+
- **[clinical_deidentification](https://nlp.johnsnowlabs.com/2024/03/27/clinical_deidentification_en.html)** 2 NER, 1 clinical embedding, 13 Rule-based NER, 3 chunk merger, 1 Deidentification
666+
667+
- **[clinical_deidentification_zeroshot_medium](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_medium_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger
668+
669+
- **[clinical_deidentification_docwise_medium_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_medium_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER, 3 chunk merger, 1 Deidentification
670+
671+
- **[clinical_deidentification_zeroshot_large](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_large_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger
672+
673+
- **[clinical_deidentification_docwise_large_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_large_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER, 3 chunk merger, 1 Deidentification
674+
675+
- CPU Testing:
676+
677+
{:.table-model-big.db}
678+
679+
| partition | clinical deidendification | clinical deidendification <br> zeroshot_medium | clinical deidendification <br> docwise_medium_wip | clinical deidendification <br> zeroshot_large | clinical deidendification <br> docwise_large_wip |
680+
|-----------|---------------------------|-------------------------------------------|----------------------------------------------|------------------------------------------|---------------------------------------------|
681+
| 4 | 295.8 | 520.8 | 862.7 | 1537.9 | 1832.4 |
682+
| 8 | 195.0 | 345.6 | 577.0 | 1013.9 | 1228.3 |
683+
| 16 | 133.3 | 227.2 | 401.8 | 666.2 | 835.2 |
684+
| 32 | 109.5 | 160.9 | 305.3 | 456.9 | 614.7 |
685+
| 64 | 92.0 | 166.8 | 291.5 | 465.0 | 584.9 |
686+
| 100 | 79.3 | 174.1 | 274.8 | 495.3 | 587.8 |
687+
| 1000 | 56.3 | 181.4 | 270.7 | 502.4 | 556.4 |
688+
689+
662690
</div><div class="h3-box" markdown="1">
663691

664692
### Deidentification Pipelines Cost Benchmarks

docs/en/benchmark_llm.md

Lines changed: 26 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,31 @@ show_nav: true
1010
sidebar:
1111
nav: sparknlp-healthcare
1212
---
13+
<div class="h3-box" markdown="1">
14+
15+
## Medical Benchmarks
16+
17+
### Benchmarking
18+
19+
{:.table-model-big.db}
20+
21+
| Model | Avarega | MedMCQA | MedQA | MMLU <br>anotomy | MMLU<br>clinical<br>knowledge | MMLU<br>college<br>biology | MMLU<br>college<br>medicine | MMLU<br>medical<br>genetics | MMLU<br>professional<br>medicine | PubMedQA |
22+
|-----------------|---------|---------|--------|------------------|-------------------------------|----------------------------|------------------------------|------------------------------|-----------------------------------|----------|
23+
| jsl_medm_q4_v3 | 0.6884 | 0.6421 | 0.6889 | 0.7333 | 0.834 | 0.8681 | 0.7514 | 0.9 | 0.8493 | 0.782 |
24+
| jsl_medm_q8_v3 | 0.6947 | 0.6416 | 0.707 | 0.7556 | 0.8377 | 0.9097 | 0.7688 | 0.9 | 0.8713 | 0.79 |
25+
| jsl_medm_q16_v3 | 0.6964 | 0.6436 | 0.7117 | 0.7481 | 0.8453 | 0.9028 | 0.7688 | 0.87 | 0.8676 | 0.794 |
26+
| jsl_meds_q4_v3 | 0.5522 | 0.5104 | 0.48 | 0.6444 | 0.7472 | 0.8333 | 0.6532 | 0.68 | 0.6691 | 0.752 |
27+
| jsl_meds_q8_v3 | 0.5727 | 0.53 | 0.4933 | 0.6593 | 0.7623 | 0.8681 | 0.6301 | 0.76 | 0.7647 | 0.762 |
28+
| jsl_meds_q16_v3 | 0.5793 | 0.5482 | 0.4839 | 0.637 | 0.7585 | 0.8403 | 0.6532 | 0.77 | 0.7022 | 0.766 |
29+
</div><div class="h3-box" markdown="1">
30+
31+
### Benchmark Summary
32+
33+
We evaluated six Johnsnow Lab LLM models across ten task categories: MedMCQA, MedQA, MMLU Anatomy, MMLU Clinical Knowledge, MMLU College Biology, MMLU College Medicine, MMLU Medical Genetics, MMLU Professional Medicine, and PubMedQA.
34+
35+
Each model's performance was measured based on accuracy, reflecting how well it handled medical reasoning, clinical knowledge, and biomedical question answering.
36+
37+
</div><div class="h3-box" markdown="1">
1338

1439
<div class="h3-box" markdown="1">
1540

@@ -204,4 +229,4 @@ GPT4o demonstrates strength in Clinical Relevance, especially in Biomedical and
204229
Neutral and "None" ratings across categories highlight areas for further optimization for both models.
205230
This analysis underscores the strengths of JSL-MedM in producing concise and factual outputs, while GPT4o shows a stronger contextual understanding in certain specialized tasks.
206231

207-
</div>
232+
</div>

0 commit comments

Comments
 (0)