Skip to content

Commit 4721c35

Browse files
Pdf name plus documentation and model card (#1816)
* Added Docs For Name Plus PDF Pipelines * Updated File Names * Added Signature Pipelines * ocr - update in release notes pic --------- Co-authored-by: albertoandreottiATgmail <albertoandreotti@gmail.com>
1 parent 7db00df commit 4721c35

7 files changed

+324
-2
lines changed

docs/_posts/nogifeet/2025-05-09-pdf_deid_multi_model_context_pipeline.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: model
3-
title: PDF Deidentification Multi Model
3+
title: PDF Deidentification Multi Model Context
44
author: John Snow Labs
55
name: pdf_deid_multi_model_context_pipeline
66
date: 2025-05-09

docs/_posts/nogifeet/2025-05-09-pdf_obfuscation_multi_model_context_pipeline.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
22
layout: model
3-
title: PDF Obfuscation Multi Model
3+
title: PDF Obfuscation Multi Model Context
44
author: John Snow Labs
55
name: pdf_obfuscation_multi_model_context_pipeline
66
date: 2025-05-09
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
layout: model
3+
title: PDF Deidentification Multilingual Name Plus
4+
author: John Snow Labs
5+
name: pdf_deid_multilingual_name_plus
6+
date: 2025-05-17
7+
tags: [en, licensed]
8+
task: De-identification
9+
language: en
10+
edition: Healthcare NLP 6.0.0
11+
spark_version: 3.2
12+
supported: true
13+
annotator: PipelineModel
14+
article_header:
15+
type: cover
16+
use_language_switcher: "Python-Scala-Java"
17+
---
18+
19+
## Description
20+
21+
This pipeline can be used to mask PHI information in PDFs. Masked entities include 'HOSPITAL', 'NAME', 'PATIENT', 'ID','MEDICALRECORD', 'IDNUM', 'COUNTRY', 'LOCATION', 'STREET', 'STATE', 'ZIP', 'CONTACT', 'PHONE', 'DATE'.
22+
The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities.
23+
24+
{:.btn-box}
25+
<button class="button button-orange" disabled>Live Demo</button>
26+
<button class="button button-orange" disabled>Open in Colab</button>
27+
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/pdf_deid_multilingual_name_plus_en_6.0.0_3.0_1747131526000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
28+
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/pdf_deid_multilingual_name_plus_en_6.0.0_3.0_1747131526000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
29+
30+
## How to use
31+
32+
<div class="tabs-box" markdown="1">
33+
{% include programmingLanguageSelectScalaPythonNLU.html %}
34+
```python
35+
from sparknlp.pretrained import PretrainedPipeline
36+
deid_pipeline = PretrainedPipeline("pdf_deid_multilingual_name_plus", "en", "clinical/ocr")
37+
```
38+
39+
</div>
40+
41+
{:.model-param}
42+
## Model Information
43+
44+
{:.table-model}
45+
|---|---|
46+
|Model Name:|pdf_deid_multilingual_name_plus|
47+
|Type:|pipeline|
48+
|Compatibility:|Healthcare NLP 6.0.0+|
49+
|License:|Licensed|
50+
|Edition:|Official|
51+
|Language:|en|
52+
|Size:|3.8 GB|
53+
54+
## Included Models
55+
56+
- PdfToImage
57+
- ImageToText
58+
- DocumentAssembler
59+
- SentenceDetectorDLModel
60+
- RegexTokenizer
61+
- PretrainedZeroShotNER
62+
- NerConverter
63+
- WordEmbeddingsModel
64+
- MedicalNerModel
65+
- NerConverter
66+
- XLMRobertaEmbeddings
67+
- MedicalNerModel
68+
- NerConverter
69+
- ContextualParser
70+
- ChunkConverter
71+
- Merge
72+
- DeIdentification
73+
- NerOutputCleaner
74+
- PositionFinder
75+
- ImageDrawRegions
76+
- ImageToPdf
Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
layout: model
3+
title: PDF Obfuscate Multilingual Name Plus
4+
author: John Snow Labs
5+
name: pdf_obfuscate_multilingual_name_plus
6+
date: 2025-05-17
7+
tags: [en, licensed]
8+
task: De-identification
9+
language: en
10+
edition: Healthcare NLP 6.0.0
11+
spark_version: 3.2
12+
supported: true
13+
annotator: PipelineModel
14+
article_header:
15+
type: cover
16+
use_language_switcher: "Python-Scala-Java"
17+
---
18+
19+
## Description
20+
21+
This pipeline can be used to mask PHI information in PDFs. Masked entities include 'HOSPITAL', 'NAME', 'PATIENT', 'ID','MEDICALRECORD', 'IDNUM', 'COUNTRY', 'LOCATION', 'STREET', 'STATE', 'ZIP', 'CONTACT', 'PHONE', 'DATE'.
22+
The output is a PDF document, similar to the one at the input, but with fake obfuscated text on top of the targeted entities.
23+
24+
{:.btn-box}
25+
<button class="button button-orange" disabled>Live Demo</button>
26+
<button class="button button-orange" disabled>Open in Colab</button>
27+
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/pdf_obfuscate_multilingual_name_plus_en_6.0.0_3.0_1747131526000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
28+
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/pdf_obfuscate_multilingual_name_plus_en_6.0.0_3.0_1747131526000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
29+
30+
## How to use
31+
32+
<div class="tabs-box" markdown="1">
33+
{% include programmingLanguageSelectScalaPythonNLU.html %}
34+
```python
35+
from sparknlp.pretrained import PretrainedPipeline
36+
deid_pipeline = PretrainedPipeline("pdf_obfuscate_multilingual_name_plus", "en", "clinical/ocr")
37+
```
38+
39+
</div>
40+
41+
{:.model-param}
42+
## Model Information
43+
44+
{:.table-model}
45+
|---|---|
46+
|Model Name:|pdf_obfuscate_multilingual_name_plus|
47+
|Type:|pipeline|
48+
|Compatibility:|Healthcare NLP 6.0.0+|
49+
|License:|Licensed|
50+
|Edition:|Official|
51+
|Language:|en|
52+
|Size:|3.8 GB|
53+
54+
## Included Models
55+
56+
- PdfToImage
57+
- ImageToText
58+
- DocumentAssembler
59+
- SentenceDetectorDLModel
60+
- RegexTokenizer
61+
- PretrainedZeroShotNER
62+
- NerConverter
63+
- WordEmbeddingsModel
64+
- MedicalNerModel
65+
- NerConverter
66+
- XLMRobertaEmbeddings
67+
- MedicalNerModel
68+
- NerConverter
69+
- ContextualParser
70+
- ChunkConverter
71+
- Merge
72+
- DeIdentification
73+
- NerOutputCleaner
74+
- PositionFinder
75+
- ImageDrawRegions
76+
- ImageToPdf
Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
---
2+
layout: model
3+
title: PDF Deidentification Multi Model Context Signature Aware
4+
author: John Snow Labs
5+
name: pdf_deid_multi_model_context_signature_aware_pipeline
6+
date: 2025-05-23
7+
tags: [en, licensed]
8+
task: De-identification
9+
language: en
10+
edition: Healthcare NLP 6.0.0
11+
spark_version: 3.2
12+
supported: true
13+
annotator: PipelineModel
14+
article_header:
15+
type: cover
16+
use_language_switcher: "Python-Scala-Java"
17+
---
18+
19+
## Description
20+
21+
This pipeline can be used to mask PHI information in PDFs. Masked entities include 'AGE', 'CITY', 'COUNTRY', 'DATE', 'DOCTOR', 'EMAIL', 'HOSPITAL', 'IDNUM', 'ORGANIZATION', 'PATIENT', 'PHONE', 'PROFESSION', 'STATE', 'STREET', 'USERNAME', 'ZIP'.
22+
The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities, also includes removing signatures.
23+
24+
{:.btn-box}
25+
<button class="button button-orange" disabled>Live Demo</button>
26+
<button class="button button-orange" disabled>Open in Colab</button>
27+
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/pdf_deid_multi_model_context_signature_aware_pipeline_en_6.0.0_3.0_1747909126000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
28+
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/pdf_deid_multi_model_context_signature_aware_pipeline_en_6.0.0_3.0_1747909126000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
29+
30+
## How to use
31+
32+
<div class="tabs-box" markdown="1">
33+
{% include programmingLanguageSelectScalaPythonNLU.html %}
34+
```python
35+
from sparknlp.pretrained import PretrainedPipeline
36+
deid_pipeline = PretrainedPipeline("pdf_deid_multi_model_context_signature_aware_pipeline", "en", "clinical/ocr")
37+
```
38+
39+
</div>
40+
41+
{:.model-param}
42+
## Model Information
43+
44+
{:.table-model}
45+
|---|---|
46+
|Model Name:|pdf_deid_multi_model_context_signature_aware_pipeline|
47+
|Type:|pipeline|
48+
|Compatibility:|Healthcare NLP 6.0.0+|
49+
|License:|Licensed|
50+
|Edition:|Official|
51+
|Language:|en|
52+
|Size:|4.7 GB|
53+
54+
## Included Models
55+
56+
- PdfToImage
57+
- ImageToText
58+
- DocumentAssembler
59+
- SentenceDetectorDLModel
60+
- Regex
61+
- WordEmbeddingsModel
62+
- MedicalNerModel
63+
- NerConverter
64+
- ContextualParserModel
65+
- ContextualParserModel
66+
- ContextualParserModel
67+
- ContextualParserModel
68+
- ContextualParserModel
69+
- ContextualParserModel
70+
- EntityExtractor
71+
- ContextualParserModel
72+
- RegexMatcher
73+
- ContextualParserModel
74+
- ContextualParserModel
75+
- ContextualParserModel
76+
- ContextualParserModel
77+
- RegexMatcher
78+
- ChunkMergeModel
79+
- ChunkMergeModel
80+
- XLMRobertaEmbeddings
81+
- MedicalNerModel
82+
- NerConverter
83+
- PretrainedZeroShotNER
84+
- NerConverter
85+
- PretrainedZeroShotNER
86+
- NerConverter
87+
- ChunkMergeModel
88+
- PositionFinder
89+
- ImageDrawRegions
90+
- HW_Signature_Detector
91+
- ImageDrawRegions
92+
- ImageToPdf
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
---
2+
layout: model
3+
title: PDF Deidentification Multilingual Name Plus Signature Aware
4+
author: John Snow Labs
5+
name: pdf_deid_multilingual_name_plus_signature_aware
6+
date: 2025-05-17
7+
tags: [en, licensed]
8+
task: De-identification
9+
language: en
10+
edition: Healthcare NLP 6.0.0
11+
spark_version: 3.2
12+
supported: true
13+
annotator: PipelineModel
14+
article_header:
15+
type: cover
16+
use_language_switcher: "Python-Scala-Java"
17+
---
18+
19+
## Description
20+
21+
This pipeline can be used to mask PHI information in PDFs. Masked entities include 'HOSPITAL', 'NAME', 'PATIENT', 'ID','MEDICALRECORD', 'IDNUM', 'COUNTRY', 'LOCATION', 'STREET', 'STATE', 'ZIP', 'CONTACT', 'PHONE', 'DATE'.
22+
The output is a PDF document, similar to the one at the input, but with black bounding boxes on top of the targeted entities, also includes removing signatures.
23+
24+
{:.btn-box}
25+
<button class="button button-orange" disabled>Live Demo</button>
26+
<button class="button button-orange" disabled>Open in Colab</button>
27+
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/clinical/ocr/pdf_deid_multilingual_name_plus_signature_aware_en_6.0.0_3.0_1747909126000.zip){:.button.button-orange.button-orange-trans.arr.button-icon.hidden}
28+
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/clinical/ocr/pdf_deid_multilingual_name_plus_signature_aware_en_6.0.0_3.0_1747909126000.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}
29+
30+
## How to use
31+
32+
<div class="tabs-box" markdown="1">
33+
{% include programmingLanguageSelectScalaPythonNLU.html %}
34+
```python
35+
from sparknlp.pretrained import PretrainedPipeline
36+
deid_pipeline = PretrainedPipeline("pdf_deid_multilingual_name_plus_signature_aware", "en", "clinical/ocr")
37+
```
38+
39+
</div>
40+
41+
{:.model-param}
42+
## Model Information
43+
44+
{:.table-model}
45+
|---|---|
46+
|Model Name:|pdf_deid_multilingual_name_plus_signature_aware|
47+
|Type:|pipeline|
48+
|Compatibility:|Healthcare NLP 6.0.0+|
49+
|License:|Licensed|
50+
|Edition:|Official|
51+
|Language:|en|
52+
|Size:|4.0 GB|
53+
54+
## Included Models
55+
56+
- PdfToImage
57+
- ImageToText
58+
- DocumentAssembler
59+
- SentenceDetectorDLModel
60+
- RegexTokenizer
61+
- PretrainedZeroShotNER
62+
- NerConverter
63+
- WordEmbeddingsModel
64+
- MedicalNerModel
65+
- NerConverter
66+
- XLMRobertaEmbeddings
67+
- MedicalNerModel
68+
- NerConverter
69+
- ContextualParser
70+
- ChunkConverter
71+
- Merge
72+
- DeIdentification
73+
- NerOutputCleaner
74+
- PositionFinder
75+
- ImageDrawRegions
76+
- HW_Signature_Detector
77+
- ImageDrawRegions
78+
- ImageToPdf
332 KB
Loading

0 commit comments

Comments
 (0)