|
| 1 | +{%- capture title -%} |
| 2 | +LightDeIdentification |
| 3 | +{%- endcapture -%} |
| 4 | + |
| 5 | +{%- capture model -%} |
| 6 | +model |
| 7 | +{%- endcapture -%} |
| 8 | + |
| 9 | +{%- capture model_description -%} |
| 10 | + |
| 11 | +Light DeIdentification is a light version of DeIdentification. It replaces sensitive information |
| 12 | +in a text with obfuscated or masked fakers. It is designed to work with healthcare data, |
| 13 | +and it can be used to de-identify patient names, dates, and other sensitive information. |
| 14 | +It can also be used to obfuscate or mask any other type of sensitive information, such as doctor names, hospital |
| 15 | +names, and other types of sensitive information. |
| 16 | +Additionally, it supports millions of embedded fakers |
| 17 | +and If desired, custom external fakers can be set with setCustomFakers function. |
| 18 | +It also supports multiple languages such as English, Spanish, French, German, and Arabic. |
| 19 | +And it supports multi-mode de-Identification with setSelectiveObfuscationModes function at the same time. |
| 20 | + |
| 21 | +Parameters: |
| 22 | + |
| 23 | +- `mode` *(str)*: Mode for Anonimizer ['mask'|'obfuscate'] |
| 24 | + |
| 25 | +- `dateEntities` *(list[str])*: List of date entities. Default: ['DATE', 'DOB', 'DOD'] |
| 26 | + |
| 27 | +- `obfuscateDate` *(Bool)*: When mode=='obfuscate' whether to obfuscate dates or not. This param helps in consistency to make dateFormats more visible. |
| 28 | + When setting to ``True``, make sure dateFormats param fits the needs. |
| 29 | + If the value is True and obfuscation is failed, then unnormalizedDateMode param will be activated. |
| 30 | + When setting to 'False', then the date will be masked to <DATE>. |
| 31 | + Default: False |
| 32 | + |
| 33 | +- `unnormalizedDateMode` *(str)*: The mode to use if the date is not formatted. Options: [mask, obfuscate, skip]. Default: obfuscate. |
| 34 | + |
| 35 | +- `days` (IntParam): Number of days to obfuscate the dates by displacement.If not provided a random integer between 1 and 60 will be used. |
| 36 | + |
| 37 | +- `useShiftDays` *(Bool)*: Whether to use the random shift day when the document has this in its metadata. Default: False |
| 38 | + |
| 39 | +- `dateFormats` (list[str]): List of date formats to automatically displace if parsed. |
| 40 | + |
| 41 | +- `region` *(str)*: The region to use for date parsing. This property is especially used when obfuscating dates. |
| 42 | + You can decide whether the first part of 11/11/2023 is a day or the second part is a day when obfuscating dates. |
| 43 | + Options: 'eu' for European Union, 'us' for the USA, Default: 'eu' |
| 44 | + |
| 45 | +- `obfuscateRefSource` *(str)*: The source of obfuscation of to obfuscate the entities. For dates entities, This property is invalid. |
| 46 | + The values ar the following: |
| 47 | + custom: Takes the entities from the setCustomFakers function. |
| 48 | + faker: Takes the entities from the Faker module |
| 49 | + both : Takes the entities from the setCustomFakers function and the faker module randomly |
| 50 | + |
| 51 | +- `language` *(str)*: The language used to select the regex file and some faker entities. |
| 52 | + The values are the following: |
| 53 | + 'en'(English), 'de'(German), 'es'(Spanish), 'fr'(French), 'ar'(Arabic) or 'ro'(Romanian). Default:'en'. |
| 54 | + |
| 55 | +- `seed` *(Int)*: It is the seed to select the entities on obfuscate mode. With the seed, |
| 56 | + you can reply to an execution several times with the same output. |
| 57 | + |
| 58 | +- `maskingPolicy` *(str)*: Select the masking policy: |
| 59 | + same_length_chars: Replace the obfuscated entity with a masking sequence composed of asterisks and surrounding squared brackets, being the total length of the masking sequence of the same length as the original sequence. |
| 60 | + Example, Smith -> [***]. |
| 61 | + If the entity is less than 3 chars (like Jo, or 5), asterisks without brackets will be returned. |
| 62 | + entity_labels: Replace the values with the corresponding entity labels. |
| 63 | + fixed_length_chars: Replace the obfuscated entity with a masking sequence composed of a fixed number of asterisk. |
| 64 | + |
| 65 | +- `fixedMaskLength` *(Int)*: The length of the masking sequence in case of fixed_length_chars masking policy. |
| 66 | + |
| 67 | +- `sameLengthFormattedEntities` (list[str]): List of formatted entities to generate the same length outputs as original ones during obfuscation. |
| 68 | + The supported and default formatted entities are: PHONE, FAX, ID, IDNUM, BIOID, MEDICALRECORD, ZIP, VIN, SSN, DLN, LICENSE, PLATE. |
| 69 | + |
| 70 | +- `genderAwareness` *(Bool)*: Whether to use gender-aware names or not during obfuscation. This param effects only names. |
| 71 | + If the value is true, it might decrease performance. Default: False |
| 72 | + |
| 73 | +- `ageRanges` (list[str]): list of integer specifying limits of the age groups to preserve during obfuscation. |
| 74 | + |
| 75 | +- `selectiveObfuscationModes` *(dict[str, dict[str]])*: The dictionary of modes to enable multi-mode deIdentification. |
| 76 | + 'obfuscate': Replace the values with random values. |
| 77 | + 'mask_same_length_chars': Replace the name with the asterisks with same length minus two plus brackets on both end. |
| 78 | + 'mask_entity_labels': Replace the values with the entity value. |
| 79 | + 'mask_fixed_length_chars': Replace the name with the asterisks with fixed length. You can also invoke "setFixedMaskLength()" |
| 80 | + 'skip': Skip the values (intact) |
| 81 | + The entities which have not been given in dictionary will deidentify according to :param:`mode` |
| 82 | + |
| 83 | +- `customFakers` *(dict[str, dict[str]])*: The dictionary of custom fakers to specify the obfuscation terms for the entities. |
| 84 | + You can specify the entity and the terms to be used for obfuscation. |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +{%- endcapture -%} |
| 89 | + |
| 90 | +{%- capture model_input_anno -%} |
| 91 | +DOCUMENT, CHUNK |
| 92 | +{%- endcapture -%} |
| 93 | + |
| 94 | +{%- capture model_output_anno -%} |
| 95 | +DOCUMENT |
| 96 | +{%- endcapture -%} |
| 97 | + |
| 98 | +{%- capture model_python_medical -%} |
| 99 | + |
| 100 | +from johnsnowlabs import nlp, medical |
| 101 | + |
| 102 | +sentences = [ |
| 103 | +['Record date: 01/01/1980'], |
| 104 | +['Johnson, M.D.'], |
| 105 | +['Gastby Hospital.'], |
| 106 | +['Camel Street.'], |
| 107 | +['My name is George.'] |
| 108 | +] |
| 109 | + |
| 110 | +input_df = spark.createDataFrame(sentences).toDF("text") |
| 111 | + |
| 112 | +document_assembler = nlp.DocumentAssembler()\ |
| 113 | + .setInputCol("text")\ |
| 114 | + .setOutputCol("document") |
| 115 | + |
| 116 | +sentence_detector = nlp.SentenceDetector()\ |
| 117 | + .setInputCols(["document"])\ |
| 118 | + .setOutputCol("sentence") |
| 119 | + |
| 120 | +tokenizer = nlp.Tokenizer()\ |
| 121 | + .setInputCols(["sentence"])\ |
| 122 | + .setOutputCol("token") |
| 123 | + |
| 124 | +word_embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") \ |
| 125 | + .setInputCols(["sentence", "token"]) \ |
| 126 | + .setOutputCol("embeddings") |
| 127 | + |
| 128 | +ner_tagger = nlp.NerDLModel.pretrained("deidentify_dl", "en", "clinical/models") \ |
| 129 | + .setInputCols(["sentence", "token", "embeddings"]) \ |
| 130 | + .setOutputCol("ner") |
| 131 | + |
| 132 | +ner_converter = medical.NerConverterInternal()\ |
| 133 | + .setInputCols(["sentence", "token", "ner"])\ |
| 134 | + .setOutputCol("ner_chunk") |
| 135 | + |
| 136 | +light_de_identification = medical.LightDeIdentification() \ |
| 137 | + .setInputCols(["ner_chunk", "sentence"]) \ |
| 138 | + .setOutputCol("dei") \ |
| 139 | + .setMode("obfuscate") \ |
| 140 | + .setObfuscateDate(True) \ |
| 141 | + .setDateFormats(["MM/dd/yyyy"]) \ |
| 142 | + .setDays(5) \ |
| 143 | + .setObfuscateRefSource('custom') \ |
| 144 | + .setCustomFakers({"DOCTOR": ["John"], "HOSPITAL": ["MEDICAL"], "STREET": ["Main Road"]}) \ |
| 145 | + .setLanguage("en") \ |
| 146 | + .setSeed(10) \ |
| 147 | + .setDateEntities(["DATE"]) \ |
| 148 | + |
| 149 | +flattener = Flattener()\ |
| 150 | + .setInputCols("dei") |
| 151 | + |
| 152 | +pipeline = nlp.Pipeline() \ |
| 153 | + .setStages([ |
| 154 | + document_assembler, |
| 155 | + sentence_detector, |
| 156 | + tokenizer, |
| 157 | + word_embeddings, |
| 158 | + ner_tagger, |
| 159 | + ner_converter, |
| 160 | + light_de_identification, |
| 161 | + flattener |
| 162 | + ]) |
| 163 | + |
| 164 | +pipeline_model = pipeline.fit(input_df) |
| 165 | +output = pipeline_model.transform(input_df) |
| 166 | +output.show(truncate=False) |
| 167 | + |
| 168 | +## Result |
| 169 | + |
| 170 | ++-----------------------+---------+-------+---------------------+--------------------------+ |
| 171 | +|dei_result |dei_begin|dei_end|dei_metadata_sentence|dei_metadata_originalIndex| |
| 172 | ++-----------------------+---------+-------+---------------------+--------------------------+ |
| 173 | +|Record date: 01/06/1980|0 |22 |0 |0 | |
| 174 | +|John, M.D. |0 |9 |0 |0 | |
| 175 | +|MEDICAL. |0 |7 |0 |0 | |
| 176 | +|Main Road. |0 |9 |0 |0 | |
| 177 | +|My name is <PATIENT>. |0 |20 |0 |0 | |
| 178 | ++-----------------------+---------+-------+---------------------+--------------------------+ |
| 179 | + |
| 180 | +{%- endcapture -%} |
| 181 | + |
| 182 | +{%- capture model_scala_medical -%} |
| 183 | +import spark.implicits._ |
| 184 | + |
| 185 | +val documentAssembler = new DocumentAssembler() |
| 186 | + .setInputCol("text") |
| 187 | + .setOutputCol("document") |
| 188 | + |
| 189 | +val sentenceDetector = new SentenceDetector() |
| 190 | + .setInputCols(Array("document")) |
| 191 | + .setOutputCol("sentence") |
| 192 | + |
| 193 | +val tokenizer = new Tokenizer() |
| 194 | + .setInputCols(Array("sentence")) |
| 195 | + .setOutputCol("token") |
| 196 | + |
| 197 | +val embeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models") |
| 198 | + .setInputCols(Array("sentence", "token")) |
| 199 | + .setOutputCol("embeddings") |
| 200 | + |
| 201 | +val clinical_sensitive_entities = MedicalNerModel.pretrained("ner_deid_enriched", "en", "clinical/models") |
| 202 | + .setInputCols(Array("sentence", "token", "embeddings")) |
| 203 | + .setOutputCol("ner") |
| 204 | + |
| 205 | +val nerConverter = new NerConverterInternal() |
| 206 | + .setInputCols(Array("sentence", "token", "ner")) |
| 207 | + .setOutputCol("chunk") |
| 208 | + |
| 209 | +val deIdentification = new LightDeIdentification() |
| 210 | + .setInputCols(Array("chunk", "sentence")).setOutputCol("dei") |
| 211 | + .setMode("obfuscate") |
| 212 | + .setObfuscateDate(true) |
| 213 | + .setDays(5) |
| 214 | + .setObfuscateRefSource("custom") |
| 215 | + .setCustomFakers(Map( |
| 216 | + "DOCTOR" -> Array("John"), |
| 217 | + "HOSPITAL" -> Array("MEDICAL"), |
| 218 | + "STREET" -> Array("Main Road"))) |
| 219 | + .setLanguage("en") |
| 220 | + .setSeed(10) |
| 221 | + .setDateEntities(Array("DATE")) |
| 222 | + |
| 223 | + |
| 224 | +val flattener = new Flattener() |
| 225 | + .setInputCols("dei") |
| 226 | + |
| 227 | +val data = Seq(""" |
| 228 | + |Record date: 2093-01-13, David Hale, M.D., Name: Hendrickson Ora. |
| 229 | + | MR # 7194334 Date: 01/13/93. PCP: Oliveira, 25 years-old, Record date: 2079-11-09. |
| 230 | + |Cocke County Baptist Hospital, 0295 Keats Street, Phone 55-555-5555.""".stripMargin |
| 231 | +).toDF("text") |
| 232 | + |
| 233 | +val pipeline = new Pipeline().setStages(Array( |
| 234 | + documentAssembler, |
| 235 | + sentenceDetector, |
| 236 | + tokenizer, |
| 237 | + embeddings, |
| 238 | + clinical_sensitive_entities, |
| 239 | + nerConverter, |
| 240 | + deIdentification, |
| 241 | + flattener |
| 242 | +)) |
| 243 | + |
| 244 | +val result = pipeline.fit(data).transform(data) |
| 245 | +result.show(truncate = false) |
| 246 | + |
| 247 | +// Result |
| 248 | + |
| 249 | ++----------------------------------------------------+---------+-------+---------------------+--------------------------+ |
| 250 | +|dei_result |dei_begin|dei_end|dei_metadata_sentence|dei_metadata_originalIndex| |
| 251 | ++----------------------------------------------------+---------+-------+---------------------+--------------------------+ |
| 252 | +|Record date: 2093-01-18, John, M.D., Name: John. |0 |47 |0 |1 | |
| 253 | +|MR # 4358590 Date: 01/18/93. |48 |75 |1 |68 | |
| 254 | +|PCP: John, <AGE> years-old, Record date: 2079-11-14.|76 |127 |2 |97 | |
| 255 | +|MEDICAL, Main Road, Phone 91-483-8495. |128 |165 |3 |151 | |
| 256 | ++----------------------------------------------------+---------+-------+---------------------+--------------------------+ |
| 257 | + |
| 258 | +{%- endcapture -%} |
| 259 | + |
| 260 | + |
| 261 | +{%- capture model_api_link -%} |
| 262 | +[LightDeIdentification](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/deid/LightDeIdentification.html) |
| 263 | +{%- endcapture -%} |
| 264 | + |
| 265 | +{%- capture model_python_api_link -%} |
| 266 | +[LightDeIdentification](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/deid/LightDeIdentification/index.html) |
| 267 | +{%- endcapture -%} |
| 268 | + |
| 269 | + |
| 270 | + |
| 271 | +{% include templates/licensed_approach_model_medical_fin_leg_template.md |
| 272 | +title=title |
| 273 | +model=model |
| 274 | +model_description=model_description |
| 275 | +model_input_anno=model_input_anno |
| 276 | +model_output_anno=model_output_anno |
| 277 | +model_python_medical=model_python_medical |
| 278 | +model_scala_medical=model_scala_medical |
| 279 | +model_api_link=model_api_link |
| 280 | +model_python_api_link=model_python_api_link |
| 281 | +%} |
0 commit comments