Skip to content

Commit 4ec9adc

Browse files
authored
release 6.0.0 (#1799)
1 parent c4bdeb2 commit 4ec9adc

8 files changed

+595
-4
lines changed
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
{%- capture title -%}
2+
AnnotationConverter
3+
{%- endcapture -%}
4+
5+
{%- capture title -%}
6+
{%- endcapture -%}
7+
8+
{%- capture model -%}
9+
model
10+
{%- endcapture -%}
11+
12+
{%- capture model_description -%}
13+
A flexible converter for transforming annotations in a DataFrame using custom logic.
14+
15+
This class allows users to define custom conversion functions (`f`) to modify annotations,
16+
enabling transformations like:
17+
- Assertion outputs → Chunk outputs
18+
- LLM outputs → Document outputs
19+
- rule-based outputs → Updated outputs
20+
21+
The converter integrates with PySpark NLP-style pipelines (e.g., DocumentAssembler, Tokenizer)
22+
but operates purely in Python (not Scala).
23+
24+
Parameters:
25+
26+
- `f`: (FunctionParam) User-defined function to transform annotations.
27+
- `inputCol`: (Param[String]) Name of the input column containing annotations.
28+
- `outputCol`: (Param[String]) Name of the output column for converted annotations.
29+
- `outputAnnotatorType`: (Param[String]) Type of the output annotations (e.g., "token").
30+
31+
32+
{%- endcapture -%}
33+
34+
35+
{%- capture model_input_anno -%}
36+
ANY
37+
{%- endcapture -%}
38+
39+
{%- capture model_output_anno -%}
40+
ANY
41+
{%- endcapture -%}
42+
43+
{%- capture model_python_medical -%}
44+
from johnsnowlabs import nlp, medical
45+
from sparknlp_jsl.annotator import AnnotationConverter
46+
47+
test_data = spark.createDataFrame([
48+
(1, """I like SparkNLP annotators such as MedicalBertForSequenceClassification and BertForAssertionClassification."""),
49+
]).toDF("id", "text")
50+
document_assembler = DocumentAssembler().setInputCol('text').setOutputCol('document')
51+
tokenizer = Tokenizer().setInputCols('document').setOutputCol('token')
52+
```
53+
def myFunction(annotations):
54+
new_annotations = []
55+
pattern = r"(?<=[a-z])(?=[A-Z])"
56+
57+
for annotation in annotations:
58+
text = annotation.result
59+
import re
60+
parts = re.split(pattern, text)
61+
begin = annotation.begin
62+
for part in parts:
63+
end = begin + len(part) - 1
64+
new_annotations.append(
65+
Annotation(
66+
annotatorType="token",
67+
begin=begin,
68+
end=end,
69+
result=part,
70+
metadata=annotation.metadata,
71+
embeddings=annotation.embeddings,
72+
)
73+
)
74+
begin = end + 1
75+
76+
return new_annotations
77+
```
78+
camel_case_tokenizer = AnnotationConverter(f=myFunction)\
79+
.setInputCol("token")\
80+
.setOutputCol("camel_case_token")\
81+
.setOutputAnnotatorType("token")
82+
83+
pipeline = Pipeline(stages=[document_assembler, tokenizer, camel_case_tokenizer])
84+
model = pipeline.fit(test_data)
85+
df = model.transform(test_data)
86+
df.selectExpr("explode(camel_case_token) as tokens").show(truncate=False)
87+
88+
89+
90+
# result
91+
92+
| tokens |
93+
|-------------------------------------------------------|
94+
| {token, 0, 0, I, {sentence -> 0}, []} |
95+
| {token, 2, 5, like, {sentence -> 0}, []} |
96+
| {token, 7, 11, Spark, {sentence -> 0}, []} |
97+
| {token, 12, 14, NLP, {sentence -> 0}, []} |
98+
| {token, 16, 25, annotators, {sentence -> 0}, []} |
99+
| {token, 27, 30, such, {sentence -> 0}, []} |
100+
| {token, 32, 33, as, {sentence -> 0}, []} |
101+
| {token, 35, 41, Medical, {sentence -> 0}, []} |
102+
| {token, 42, 45, Bert, {sentence -> 0}, []} |
103+
| {token, 46, 48, For, {sentence -> 0}, []} |
104+
| {token, 49, 56, Sequence, {sentence -> 0}, []} |
105+
| {token, 57, 70, Classification, {sentence -> 0}, []} |
106+
| {token, 72, 74, and, {sentence -> 0}, []} |
107+
| {token, 76, 79, Bert, {sentence -> 0}, []} |
108+
| {token, 80, 82, For, {sentence -> 0}, []} |
109+
| {token, 83, 91, Assertion, {sentence -> 0}, []} |
110+
| {token, 92, 105, Classification, {sentence -> 0}, []} |
111+
| {token, 106, 106, ., {sentence -> 0}, []} |
112+
113+
114+
{%- endcapture -%}
115+
116+
117+
118+
{%- capture model_api_link -%}
119+
[AnnotationConverter](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/AnnotationConverter.html)
120+
{%- endcapture -%}
121+
{%- capture model_python_api_link -%}
122+
123+
[AnnotationConverter](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/annotation_converter/index.html)
124+
{%- endcapture -%}
125+
126+
127+
{% include templates/licensed_approach_model_medical_fin_leg_template.md
128+
title=title
129+
model=model
130+
model_description=model_description
131+
model_input_anno=model_input_anno
132+
model_output_anno=model_output_anno
133+
model_python_medical=model_python_medical
134+
model_api_link=model_api_link
135+
model_python_api_link=model_python_api_link
136+
%}
Lines changed: 218 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,218 @@
1+
{%- capture title -%}
2+
BertForAssertionClassification
3+
{%- endcapture -%}
4+
5+
{%- capture model -%}
6+
model
7+
{%- endcapture -%}
8+
9+
{%- capture model_description -%}
10+
BertForAssertionClassification extracts the assertion status from text by analyzing both the extracted entities
11+
and their surrounding context.
12+
13+
This classifier leverages pre-trained BERT models fine-tuned on biomedical text (e.g., BioBERT) and applies a
14+
sequence classification/regression head (a linear layer on the pooled output) to support multi-class document
15+
classification.
16+
17+
**Key features:**
18+
19+
- Accepts DOCUMENT and CHUNK type inputs and produces ASSERTION type annotations.
20+
- Emphasizes entity context by marking target entities with special tokens (e.g., [entity]), allowing the model to better focus on them.
21+
- Utilizes a transformer-based architecture (BERT for Sequence Classification) to achieve accurate assertion status prediction.
22+
23+
**Input Example:**
24+
25+
This annotator preprocesses the input text to emphasize the target entities as follows:
26+
[CLS] Patient with [entity] severe fever [entity].
27+
28+
Models from the HuggingFace 🤗 Transformers library are also compatible with
29+
Spark NLP 🚀. To see which models are compatible and how to import them see
30+
Import Transformers into Spark NLP 🚀
31+
https://github.com/JohnSnowLabs/spark-nlp/discussions/5669
32+
33+
Parameters:
34+
35+
- `configProtoBytes`: ConfigProto from tensorflow, serialized into byte array.
36+
- `classificationCaseSensitive`: Whether to use case sensitive classification. Default is True.
37+
38+
39+
40+
{%- endcapture -%}
41+
42+
43+
{%- capture model_input_anno -%}
44+
DOCUMENT, CHUNK
45+
{%- endcapture -%}
46+
47+
{%- capture model_output_anno -%}
48+
ASSERTION
49+
{%- endcapture -%}
50+
51+
{%- capture model_python_medical -%}
52+
from johnsnowlabs import nlp, medical
53+
54+
document_assembler = nlp.DocumentAssembler()\
55+
.setInputCol("text") \
56+
.setOutputCol("document")
57+
58+
sentence_detector = nlp.SentenceDetector()\
59+
.setInputCols("document")\
60+
.setOutputCol("sentence")
61+
62+
tokenizer = nlp.Tokenizer()\
63+
.setInputCols(["document"])\
64+
.setOutputCol("token")
65+
66+
embeddings = nlp.WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")\
67+
.setInputCols(["sentence", "token"])\
68+
.setOutputCol("embeddings")\
69+
.setCaseSensitive(False)
70+
71+
ner = medical.NerModel.pretrained("ner_clinical", "en", "clinical/models")\
72+
.setInputCols(["sentence", "token", "embeddings"])\
73+
.setOutputCol("ner")
74+
75+
ner_converter = medical.NerConverterInternal()\
76+
.setInputCols(["sentence", "token", "ner"])\
77+
.setOutputCol("ner_chunk")\
78+
.setWhiteList(["PROBLEM"])
79+
80+
assertion_classifier = medical.BertForAssertionClassification.pretrained("assertion_bert_classification_clinical", "en", "clinical/models")\
81+
.setInputCols(["sentence", "ner_chunk"])\
82+
.setOutputCol("assertion_class")
83+
84+
pipeline = nlp.Pipeline(stages=[
85+
document_assembler,
86+
sentence_detector,
87+
tokenizer,
88+
embeddings,
89+
ner,
90+
ner_converter,
91+
assertion_classifier
92+
])
93+
94+
text = """
95+
GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
96+
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
97+
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
98+
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
99+
EXTREMITIES: There is some edema, but no cyanosis and clubbing .
100+
"""
101+
102+
data = spark.createDataFrame([[text]]).toDF("text")
103+
result = pipeline.fit(data).transform(data)
104+
105+
106+
# result
107+
108+
+--------------------------------------------------------------+-----+----+---------+----------------------+
109+
|ner_chunk |begin|end |ner_label|assertion_class_result|
110+
+--------------------------------------------------------------+-----+----+---------+----------------------+
111+
|acute distress |43 |56 |PROBLEM |absent |
112+
|mild arcus senilis in the right |191 |221 |PROBLEM |present |
113+
|jugular venous pressure distention |380 |413 |PROBLEM |absent |
114+
|adenopathy in the cervical, supraclavicular, or axillary areas|428 |489 |PROBLEM |absent |
115+
|tender |514 |519 |PROBLEM |absent |
116+
|some fullness in the left upper quadrant |535 |574 |PROBLEM |possible |
117+
|some edema |660 |669 |PROBLEM |present |
118+
|cyanosis |679 |686 |PROBLEM |absent |
119+
|clubbing |692 |699 |PROBLEM |absent |
120+
+--------------------------------------------------------------+-----+----+---------+----------------------+
121+
122+
123+
{%- endcapture -%}
124+
125+
126+
{%- capture model_scala_medical -%}
127+
128+
import spark.implicits._
129+
130+
val documentAssembler = new DocumentAssembler()
131+
.setInputCol("text")
132+
.setOutputCol("document")
133+
134+
val tokenizer = new Tokenizer()
135+
.setInputCols("document")
136+
.setOutputCol("token")
137+
138+
val wordEmbeddings = WordEmbeddingsModel.pretrained("embeddings_clinical", "en", "clinical/models")
139+
.setInputCols("document", "token")
140+
.setOutputCol("embeddings")
141+
142+
val jslNer = MedicalNerModel.pretrained("ner_jsl", "en", "clinical/models")
143+
.setInputCols("sentence", "token", "embeddings")
144+
.setOutputCol("jsl_ner")
145+
146+
val jslNerConverter = new NerConverterInternal()
147+
.setInputCols("sentence", "token", "jsl_ner")
148+
.setOutputCol("ner_chunks")
149+
150+
val clinicalAssertion = BertForAssertionClassification.pretrained("assertion_bert_classification_clinical", "en", "clinical/models")
151+
.setInputCols("sentence", "ner_chunk")
152+
.setOutputCol("assertion")
153+
.setCaseSensitive(false)
154+
155+
val pipeline = new Pipeline().setStages(
156+
Array(
157+
documentAssembler,
158+
sentenceDetector,
159+
tokenizer,
160+
wordEmbeddings,
161+
jslNer,
162+
jslNerConverter,
163+
clinicalAssertion
164+
))
165+
166+
val text = "GENERAL: He is an elderly gentleman in no acute distress. He is sitting up in bed eating his breakfast. He is alert and oriented and answering questions appropriately.
167+
HEENT: Sclerae showed mild arcus senilis in the right. Left was clear. Pupils are equally round and reactive to light. Extraocular movements are intact. Oropharynx is clear.
168+
NECK: Supple. Trachea is midline. No jugular venous pressure distention is noted. No adenopathy in the cervical, supraclavicular, or axillary areas.
169+
ABDOMEN: Soft and not tender. There may be some fullness in the left upper quadrant, although I do not appreciate a true spleen with inspiration.
170+
EXTREMITIES: There is some edema, but no cyanosis and clubbing ."
171+
172+
val df = Seq(text).toDF("text")
173+
val result = pipeline.fit(df).transform(df)
174+
175+
176+
# result
177+
+--------------------------------------------------------------+-----+----+---------+----------------------+
178+
|ner_chunk |begin|end |ner_label|assertion_class_result|
179+
+--------------------------------------------------------------+-----+----+---------+----------------------+
180+
|acute distress |43 |56 |PROBLEM |absent |
181+
|mild arcus senilis in the right |191 |221 |PROBLEM |present |
182+
|jugular venous pressure distention |380 |413 |PROBLEM |absent |
183+
|adenopathy in the cervical, supraclavicular, or axillary areas|428 |489 |PROBLEM |absent |
184+
|tender |514 |519 |PROBLEM |absent |
185+
|some fullness in the left upper quadrant |535 |574 |PROBLEM |possible |
186+
|some edema |660 |669 |PROBLEM |present |
187+
|cyanosis |679 |686 |PROBLEM |absent |
188+
|clubbing |692 |699 |PROBLEM |absent |
189+
+--------------------------------------------------------------+-----+----+---------+----------------------+
190+
191+
{%- endcapture -%}
192+
193+
{%- capture model_api_link -%}
194+
[BertForAssertionClassification](https://nlp.johnsnowlabs.com/licensed/api/com/johnsnowlabs/nlp/annotators/assertion/BertForAssertionClassification.html)
195+
{%- endcapture -%}
196+
197+
{%- capture model_python_api_link -%}
198+
199+
[BertForAssertionClassification](https://nlp.johnsnowlabs.com/licensed/api/python/reference/autosummary/sparknlp_jsl/annotator/assertion/bert_for_assertion_classification/index.html)
200+
201+
{%- endcapture -%}
202+
203+
{%- capture model_notebook_link -%}
204+
[BertForAssertionClassification](https://github.com/JohnSnowLabs/spark-nlp-workshop/blob/master/tutorials/Certification_Trainings/Healthcare/2.4.BertForAssertionClassification.ipynb)
205+
{%- endcapture -%}
206+
207+
{% include templates/licensed_approach_model_medical_fin_leg_template.md
208+
title=title
209+
model=model
210+
model_description=model_description
211+
model_input_anno=model_input_anno
212+
model_output_anno=model_output_anno
213+
model_python_medical=model_python_medical
214+
model_scala_medical=model_scala_medical
215+
model_api_link=model_api_link
216+
model_python_api_link=model_python_api_link
217+
model_notebook_link=model_notebook_link
218+
%}

docs/en/licensed_annotator_entries/ContextualEntityRuler.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -47,10 +47,6 @@ Parameters:
4747

4848
{%- endcapture -%}
4949

50-
{%- capture model_input_anno -%}
51-
52-
53-
{%- endcapture -%}
5450

5551
{%- capture model_input_anno -%}
5652
DOCUMENT, TOKEN, CHUNK

docs/en/licensed_annotator_entries/DeIdentification.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,29 @@ Default: False.
127127

128128
- `fakerLengthOffset` : It specifies how much length deviation is accepted in obfuscation, with `keepTextSizeForObfuscation` enabled. It must be greater than 0.
129129

130+
- `consistentAcrossNameParts` : Param that indicates whether consistency should be enforced across different parts of a name
131+
(e.g., first name, middle name, last name).
132+
133+
When set to `True`, the same transformation or obfuscation will be applied consistently to all parts
134+
of the same name entity, even if those parts appear separately.
135+
136+
For example, if "John Smith" is obfuscated as "Liam Brown", then:
137+
- When the full name "John Smith" appears, it will be replaced with "Liam Brown"
138+
- When "John" or "Smith" appear individually, they will still be obfuscated as "Liam" and "Brown" respectively,
139+
ensuring consistency in name transformation.
140+
141+
Default: True
142+
143+
- `groupByCol` : The column name used to group the dataset. This parameter is used in conjunction with
144+
`consistentObfuscation` to ensure consistent obfuscation within each group.
145+
When `groupByCol` is set, the dataset is partitioned into groups based on the values of the specified column.
146+
Default: `""` (empty string, meaning no grouping)
147+
148+
- `chunkMatching` :Performs entity chunk matching across rows or within groups in a DataFrame.
149+
This function is useful in de-identification pipelines where certain entity labels
150+
like "NAME" or "DATE" may be missing in some rows and need to be filled from other
151+
rows within the same group.
152+
130153

131154
To create a configured DeIdentificationModel, please see the example of DeIdentification.
132155
{%- endcapture -%}

0 commit comments

Comments
 (0)