Skip to content

Commit efe4f1b

Browse files
authored
updated annotators (#1123)
1 parent f2a55fc commit efe4f1b

File tree

4 files changed

+481
-0
lines changed

4 files changed

+481
-0
lines changed

docs/en/annotators.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,8 @@ There are two types of Annotators:
6969
{% include templates/anno_table_entry.md path="" name="ImageAssembler" summary="Prepares images read by Spark into a format that is processable by Spark NLP."%}
7070
{% include templates/anno_table_entry.md path="" name="LanguageDetectorDL" summary="Language Identification and Detection by using CNN and RNN architectures in TensorFlow."%}
7171
{% include templates/anno_table_entry.md path="" name="Lemmatizer" summary="Finds lemmas out of words with the objective of returning a base dictionary word."%}
72+
{% include templates/licensed_table_entry.md name="LightDeIdentification" summary="Light version of DeIdentification."%}
73+
{% include templates/licensed_table_entry.md name="MultiChunk2Doc" summary="Merges a given chunks to create a document."%}
7274
{% include templates/anno_table_entry.md path="" name="MultiClassifierDL" summary="Multi-label Text Classification."%}
7375
{% include templates/anno_table_entry.md path="" name="MultiDateMatcher" summary="Matches standard date formats into a provided format."%}
7476
{% include templates/anno_table_entry.md path="" name="MultiDocumentAssembler" summary="Prepares data into a format that is processable by Spark NLP."%}
Lines changed: 281 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,281 @@
1+
{%- capture title -%}
2+
LightDeIdentification
3+
{%- endcapture -%}
4+
5+
{%- capture model -%}
6+
model
7+
{%- endcapture -%}
8+
9+
{%- capture model_description -%}
10+
11+
Light DeIdentification is a light version of DeIdentification. It replaces sensitive information
12+
in a text with obfuscated or masked fakers. It is designed to work with healthcare data,
13+
and it can be used to de-identify patient names, dates, and other sensitive information.
14+
It can also be used to obfuscate or mask any other type of sensitive information, such as doctor names, hospital
15+
names, and other types of sensitive information.
16+
Additionally, it supports millions of embedded fakers
17+
and If desired, custom external fakers can be set with setCustomFakers function.
18+
It also supports multiple languages such as English, Spanish, French, German, and Arabic.
19+
And it supports multi-mode de-Identification with setSelectiveObfuscationModes function at the same time.
20+
21+
Parameters:
22+
23+
- `mode` *(str)*: Mode for Anonimizer ['mask'|'obfuscate']
24+
25+
- `dateEntities` *(list[str])*: List of date entities. Default: ['DATE', 'DOB', 'DOD']
26+
27+
- `obfuscateDate` *(Bool)*: When mode=='obfuscate' whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible.
28+
When setting to ``True``, make sure dateFormats param fits the needs.
29+
If the value is True and obfuscation is failed, then unnormalizedDateMode param will be activated.
30+
When setting to 'False', then the date will be masked to <DATE>.
31+
Default: False
32+
33+
- `unnormalizedDateMode` *(str)*: The mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate.
34+
35+
- `days` (IntParam): Number of days to obfuscate the dates by displacement.If not provided a random integer between 1 and 60 will be used.
36+
37+
- `useShiftDays` *(Bool)*: Whether to use the random shift day when the document has this in its metadata. Default: False
38+
39+
- `dateFormats` (list[str]): List of date formats to automatically displace if parsed.
40+
41+
- `region` *(str)*: The region to use for date parsing. This property is especially used when obfuscating dates.
42+
You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates.
43+
Options: 'eu' for European Union, 'us' for the USA, Default: 'eu'
44+
45+
- `obfuscateRefSource` *(str)*: The source of obfuscation of to obfuscate the entities. For dates entities, This property is invalid.
46+
The values ar the following:
47+
custom: Takes the entities from the setCustomFakers function.
48+
faker: Takes the entities from the Faker module
49+
both : Takes the entities from the setCustomFakers function and the faker module randomly
50+
51+
- `language` *(str)*: The language used to select the regex file and some faker entities.
52+
The values are the following:
53+
'en'(English), 'de'(German), 'es'(Spanish), 'fr'(French), 'ar'(Arabic) or 'ro'(Romanian). Default:'en'.
54+
55+
- `seed` *(Int)*: It is the seed to select the entities on obfuscate mode. With the seed,
56+
you can reply to an execution several times with the same output.
57+
58+
- `maskingPolicy` *(str)*: Select the masking policy:
59+
same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence.
60+
Example, Smith -> [***].
61+
If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned.
62+
entity_labels: Replace the values with the corresponding entity labels.
63+
fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk.
64+
65+
- `fixedMaskLength` *(Int)*: The length of the masking sequence in case of fixed_length_chars masking policy.
66+
67+
- `sameLengthFormattedEntities` (list[str]): List of formatted entities to generate the same length outputs as original ones during obfuscation.
68+
The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE.
69+
70+
- `genderAwareness` *(Bool)*: Whether to use gender-aware names or not during obfuscation. This param effects only names.
71+
If the value is true, it might decrease performance. Default: False
72+
73+
- `ageRanges` (list[str]): list of integer specifying limits of the age groups to preserve during obfuscation.
74+
75+
- `selectiveObfuscationModes` *(dict[str, dict[str]])*: The dictionary of modes to enable multi-mode deIdentification.
76+
'obfuscate': Replace the values with random values.
77+
'mask_same_length_chars': Replace the name with the asterisks with same length minus two plus brackets on both end.
78+
'mask_entity_labels': Replace the values with the entity value.
79+
'mask_fixed_length_chars': Replace the name with the asterisks with fixed length. You can also invoke "setFixedMaskLength()"
80+
'skip': Skip the values (intact)
81+
The entities which have not been given in dictionary will deidentify according to :param:`mode`
82+
83+
- `customFakers` *(dict[str, dict[str]])*: The dictionary of custom fakers to specify the obfuscation terms for the entities.
84+
You can specify the entity and the terms to be used for obfuscation.
85+
86+
87+
88+
{%- endcapture -%}
89+
90+
{%- capture model_input_anno -%}
91+
DOCUMENT, CHUNK
92+
{%- endcapture -%}
93+
94+
{%- capture model_output_anno -%}
95+
DOCUMENT
96+
{%- endcapture -%}
97+
98+
{%- capture model_python_medical -%}
99+
100+
from johnsnowlabs import nlp, medical
101+
102+
sentences = [
103+
['Record date: 01/01/1980'],
104+
['Johnson, M.D.'],
105+
['Gastby Hospital.'],
106+
['Camel Street.'],
107+
['My name is George.']
108+
]
109+
110+
input_df = spark.createDataFrame(sentences).toDF("text")
111+
112+
document_assembler = nlp.DocumentAssembler()\
113+
.setInputCol("text")\
114+
.setOutputCol("document")
115+
116+
sentence_detector = nlp.SentenceDetector()\
117+
.setInputCols(["document"])\
118+
.setOutputCol("sentence")
119+
120+
tokenizer = nlp.Tokenizer()\
121+
.setInputCols(["sentence"])\
122+
.setOutputCol("token")
123+
124+
word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \
125+
.setInputCols(["sentence", "token"]) \
126+
.setOutputCol("embeddings")
127+
128+
ner_tagger = nlp.NerDLModel.pretrained("deidentify_dl", "en", "clinical/models") \
129+
.setInputCols(["sentence", "token", "embeddings"]) \
130+
.setOutputCol("ner")
131+
132+
ner_converter = medical.NerConverterInternal()\
133+
.setInputCols(["sentence", "token", "ner"])\
134+
.setOutputCol("ner_chunk")
135+
136+
light_de_identification = medical.LightDeIdentification() \
137+
.setInputCols(["ner_chunk", "sentence"]) \
138+
.setOutputCol("dei") \
139+
.setMode("obfuscate") \
140+
.setObfuscateDate(True) \
141+
.setDateFormats(["MM/dd/yyyy"]) \
142+
.setDays(5) \
143+
.setObfuscateRefSource('custom') \
144+
.setCustomFakers({"DOCTOR": ["John"], "HOSPITAL": ["MEDICAL"], "STREET": ["Main Road"]}) \
145+
.setLanguage("en") \
146+
.setSeed(10) \
147+
.setDateEntities(["DATE"]) \
148+
149+
flattener = Flattener()\
150+
.setInputCols("dei")
151+
152+
pipeline = nlp.Pipeline() \
153+
.setStages([
154+
document_assembler,
155+
sentence_detector,
156+
tokenizer,
157+
word_embeddings,
158+
ner_tagger,
159+
ner_converter,
160+
light_de_identification,
161+
flattener
162+
])
163+
164+
pipeline_model = pipeline.fit(input_df)
165+
output = pipeline_model.transform(input_df)
166+
output.show(truncate=False)
167+
168+
## Result
169+
170+
+-----------------------+---------+-------+---------------------+--------------------------+
171+
|dei_result |dei_begin|dei_end|dei_metadata_sentence|dei_metadata_originalIndex|
172+
+-----------------------+---------+-------+---------------------+--------------------------+
173+
|Record date: 01/06/1980|0 |22 |0 |0 |
174+
|John, M.D. |0 |9 |0 |0 |
175+
|MEDICAL. |0 |7 |0 |0 |
176+
|Main Road. |0 |9 |0 |0 |
177+
|My name is <PATIENT>. |0 |20 |0 |0 |
178+
+-----------------------+---------+-------+---------------------+--------------------------+
179+
180+
{%- endcapture -%}
181+
182+
{%- capture model_scala_medical -%}
183+
import spark.implicits._
184+
185+
val documentAssembler = new DocumentAssembler()
186+
.setInputCol("text")
187+
.setOutputCol("document")
188+
189+
val sentenceDetector = new SentenceDetector()
190+
.setInputCols(Array("document"))
191+
.setOutputCol("sentence")
192+
193+
val tokenizer = new Tokenizer()
194+
.setInputCols(Array("sentence"))
195+
.setOutputCol("token")
196+
197+
val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
198+
.setInputCols(Array("sentence", "token"))
199+
.setOutputCol("embeddings")
200+
201+
val clinical_sensitive_entities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models")
202+
.setInputCols(Array("sentence", "token", "embeddings"))
203+
.setOutputCol("ner")
204+
205+
val nerConverter = new NerConverterInternal()
206+
.setInputCols(Array("sentence", "token", "ner"))
207+
.setOutputCol("chunk")
208+
209+
val deIdentification = new LightDeIdentification()
210+
.setInputCols(Array("chunk", "sentence")).setOutputCol("dei")
211+
.setMode("obfuscate")
212+
.setObfuscateDate(true)
213+
.setDays(5)
214+
.setObfuscateRefSource("custom")
215+
.setCustomFakers(Map(
216+
"DOCTOR" -> Array("John"),
217+
"HOSPITAL" -> Array("MEDICAL"),
218+
"STREET" -> Array("Main Road")))
219+
.setLanguage("en")
220+
.setSeed(10)
221+
.setDateEntities(Array("DATE"))
222+
223+
224+
val flattener = new Flattener()
225+
.setInputCols("dei")
226+
227+
val data = Seq("""
228+
|Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson Ora.
229+
| MR # 7194334 Date: 01/13/93. PCP: Oliveira, 25 years-old, Record date: 2079-11-09.
230+
|Cocke County Baptist Hospital, 0295 Keats Street, Phone 55-555-5555.""".stripMargin
231+
).toDF("text")
232+
233+
val pipeline = new Pipeline().setStages(Array(
234+
documentAssembler,
235+
sentenceDetector,
236+
tokenizer,
237+
embeddings,
238+
clinical_sensitive_entities,
239+
nerConverter,
240+
deIdentification,
241+
flattener
242+
))
243+
244+
val result = pipeline.fit(data).transform(data)
245+
result.show(truncate = false)
246+
247+
// Result
248+
249+
+----------------------------------------------------+---------+-------+---------------------+--------------------------+
250+
|dei_result |dei_begin|dei_end|dei_metadata_sentence|dei_metadata_originalIndex|
251+
+----------------------------------------------------+---------+-------+---------------------+--------------------------+
252+
|Record date: 2093-01-18, John, M.D., Name: John. |0 |47 |0 |1 |
253+
|MR # 4358590 Date: 01/18/93. |48 |75 |1 |68 |
254+
|PCP: John, <AGE> years-old, Record date: 2079-11-14.|76 |127 |2 |97 |
255+
|MEDICAL, Main Road, Phone 91-483-8495. |128 |165 |3 |151 |
256+
+----------------------------------------------------+---------+-------+---------------------+--------------------------+
257+
258+
{%- endcapture -%}
259+
260+
261+
{%- capture model_api_link -%}
262+
[LightDeIdentification](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/deid/LightDeIdentification.html)
263+
{%- endcapture -%}
264+
265+
{%- capture model_python_api_link -%}
266+
[LightDeIdentification](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/deid/LightDeIdentification/index.html)
267+
{%- endcapture -%}
268+
269+
270+
271+
{% include templates/licensed_approach_model_medical_fin_leg_template.md
272+
title=title
273+
model=model
274+
model_description=model_description
275+
model_input_anno=model_input_anno
276+
model_output_anno=model_output_anno
277+
model_python_medical=model_python_medical
278+
model_scala_medical=model_scala_medical
279+
model_api_link=model_api_link
280+
model_python_api_link=model_python_api_link
281+
%}

0 commit comments

Comments
 (0)