upload new benchmark table (#1669)

akrztrk · web-flow · commit d2281bbc04b1 · 2024-12-19T20:34:25.000+01:00
* upload new benchmark table

* upload llm bencmark table

* update typo

* Update benchmark.md
diff --git a/docs/en/benchmark.md b/docs/en/benchmark.md
@@ -659,6 +659,34 @@ deid_pipeline = Pipeline().setStages([
 
 PS: The reason why pipelines with the same stages have different costs is due to the layers of the NER model and the hardcoded regexes in Deidentification.
 
+
+- ZeroShot Deidentification Pipelines Speed Comparison
+  
+    - **[clinical_deidentification](https://nlp.johnsnowlabs.com/2024/03/27/clinical_deidentification_en.html)** 2 NER, 1 clinical embedding, 13 Rule-based NER, 3 chunk merger, 1 Deidentification
+
+    - **[clinical_deidentification_zeroshot_medium](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_medium_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger 
+
+    - **[clinical_deidentification_docwise_medium_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_medium_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER,  3 chunk merger, 1 Deidentification
+
+    - **[clinical_deidentification_zeroshot_large](https://nlp.johnsnowlabs.com/2024/12/04/clinical_deidentification_zeroshot_large_en.html)** 1 ZeroShotNER, 18 Rule-based NER, 2 chunk merger 
+
+    - **[clinical_deidentification_docwise_large_wip](https://nlp.johnsnowlabs.com/2024/12/03/clinical_deidentification_docwise_large_wip_en.html)** 1 ZeroShotNER, 4 NER, 1 clinical embedding, 18 Rule-based NER, 3 chunk merger, 1 Deidentification
+
+- CPU Testing:
+
+{:.table-model-big.db}
+
+| partition | clinical deidendification | clinical deidendification <br> zeroshot_medium | clinical deidendification  <br> docwise_medium_wip | clinical deidendification  <br>  zeroshot_large | clinical deidendification  <br> docwise_large_wip |
+|-----------|---------------------------|-------------------------------------------|----------------------------------------------|------------------------------------------|---------------------------------------------|
+|         4 |                     295.8 |                                     520.8 |                                        862.7 |                                   1537.9 |                                      1832.4 |
+|         8 |                     195.0 |                                     345.6 |                                        577.0 |                                   1013.9 |                                      1228.3 |
+|        16 |                     133.3 |                                     227.2 |                                        401.8 |                                    666.2 |                                       835.2 |
+|        32 |                     109.5 |                                     160.9 |                                        305.3 |                                    456.9 |                                       614.7 |
+|        64 |                      92.0 |                                     166.8 |                                        291.5 |                                    465.0 |                                       584.9 |
+|       100 |                      79.3 |                                     174.1 |                                        274.8 |                                    495.3 |                                       587.8 |
+|      1000 |                      56.3 |                                     181.4 |                                        270.7 |                                    502.4 |                                       556.4 |
+    
+
 </div><div class="h3-box" markdown="1">
 
 ### Deidentification Pipelines Cost Benchmarks 
diff --git a/docs/en/benchmark_llm.md b/docs/en/benchmark_llm.md
@@ -10,6 +10,31 @@ show_nav: true
 sidebar:
     nav: sparknlp-healthcare
 ---
+<div class="h3-box" markdown="1">
+
+##  Medical Benchmarks
+
+### Benchmarking
+
+{:.table-model-big.db}
+
+| Model           | Avarega | MedMCQA | MedQA  | MMLU <br>anotomy | MMLU<br>clinical<br>knowledge | MMLU<br>college<br>biology | MMLU<br>college<br>medicine  | MMLU<br>medical<br>genetics  | MMLU<br>professional<br>medicine  | PubMedQA |
+|-----------------|---------|---------|--------|------------------|-------------------------------|----------------------------|------------------------------|------------------------------|-----------------------------------|----------|
+| jsl_medm_q4_v3  | 0.6884  | 0.6421  | 0.6889 | 0.7333           | 0.834                         | 0.8681                     | 0.7514                       | 0.9                          | 0.8493                            | 0.782    |
+| jsl_medm_q8_v3  | 0.6947  | 0.6416  | 0.707  | 0.7556           | 0.8377                        | 0.9097                     | 0.7688                       | 0.9                          | 0.8713                            | 0.79     |
+| jsl_medm_q16_v3 | 0.6964  | 0.6436  | 0.7117 | 0.7481           | 0.8453                        | 0.9028                     | 0.7688                       | 0.87                         | 0.8676                            | 0.794    |
+| jsl_meds_q4_v3  | 0.5522  | 0.5104  | 0.48   | 0.6444           | 0.7472                        | 0.8333                     | 0.6532                       | 0.68                         | 0.6691                            | 0.752    |
+| jsl_meds_q8_v3  | 0.5727  | 0.53    | 0.4933 | 0.6593           | 0.7623                        | 0.8681                     | 0.6301                       | 0.76                         | 0.7647                            | 0.762    |
+| jsl_meds_q16_v3 | 0.5793  | 0.5482  | 0.4839 | 0.637            | 0.7585                        | 0.8403                     | 0.6532                       | 0.77                         | 0.7022                            | 0.766    |
+</div><div class="h3-box" markdown="1">
+
+### Benchmark Summary
+
+We evaluated six Johnsnow Lab LLM models across ten task categories: MedMCQA, MedQA, MMLU Anatomy, MMLU Clinical Knowledge, MMLU College Biology, MMLU College Medicine, MMLU Medical Genetics, MMLU Professional Medicine, and PubMedQA.
+
+Each model's performance was measured based on accuracy, reflecting how well it handled medical reasoning, clinical knowledge, and biomedical question answering. 
+
+</div><div class="h3-box" markdown="1">
 
 <div class="h3-box" markdown="1">
 
@@ -204,4 +229,4 @@ GPT4o demonstrates strength in Clinical Relevance, especially in Biomedical and
 Neutral and "None" ratings across categories highlight areas for further optimization for both models.
 This analysis underscores the strengths of JSL-MedM in producing concise and factual outputs, while GPT4o shows a stronger contextual understanding in certain specialized tasks.
 
-</div>
+</div>