You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/en/spark_ocr_versions/ocr_release_notes.md
+84-92Lines changed: 84 additions & 92 deletions
Original file line number
Diff line number
Diff line change
@@ -13,129 +13,121 @@ sidebar:
13
13
14
14
<divclass="h3-box"markdown="1">
15
15
16
-
## 5.3.1
16
+
## 5.4.0
17
17
18
-
Release date: 11-04-2024
18
+
Release date: 15-07-2024
19
19
20
20
21
-
## Visual NLP 5.3.1 Release Notes 🕶️
21
+
## Visual NLP 5.4.0 Release Notes 🕶️
22
22
23
+
**we're glad to announce that Visual NLP 5.4.0 has been released. New transformers, notebooks, metrics, bug fixes and more!!! 📢📢📢**
23
24
24
-
**we're glad to announce that Visual NLP 5.3.1 has been released. New models, notebooks, bug fixes and more!!! 📢📢📢**
25
-
25
+
</div><divclass="h3-box"markdown="1">
26
26
27
27
## Highlights 🔴
28
28
29
-
+ Improved table extraction capabilities in HocrToTextTable.
30
-
+ Improvements in LightPipelines.
31
-
+ Confidence scores in ImageToTextV2.
32
-
+ New HocrMerger annotator.
33
-
+ Checkbox detection in Visual NER.
34
-
+ New Document Clustering Pipeline using Vit Embeddings.
35
-
+ Enhancements to color options in ImageDrawRegions.
36
-
+ New notebooks & updates.
37
-
+ Bug Fixes.
38
-
39
-
## Improved table extraction capabilities in HocrToTextTable
40
-
Many issues related to column detection in our Table Extraction pipelines are addressed in this release, compared to previous Visual NLP version the metrics have improved. Table below shows F1-score(CAR or Cell Adjacency Relationship) performances on ICDAR 19 Track B dataset for different IoU values of our two versions in comparison with [other results](https://paperswithcode.com/paper/multi-type-td-tsr-extracting-tables-from/review/).
(*) This is the cell adjacency Table Extraction metric as defined by ICDAR Table Extraction Challenge.
49
+
The improvements are measured against previous release of Visual NLP.
50
+
51
+
</div><divclass="h3-box"markdown="1">
91
52
92
-

53
+
### Dicom Transformers access to S3 directly
54
+
Now Dicom Transformers can access S3 directly from executors instead of reading through the Spark Dataframe. This is particularly advantageous in the situation where we only care about the metadata of each file because we don't need to load the entire file into memory, also,
55
+
* It reduces memory usage and allows processing of files larger than 2 GB (a limitation of Spark).
56
+
* It improves performance when computing statistics over large DICOM datasets.
93
57
94
-
Check this [updated notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/TextRecognition/SparkOcrImageToTextV2.ipynb) for an end-to-end example.
58
+
</div><divclass="h3-box"markdown="1">\
95
59
60
+
### New Options for ImageToPdf transformer.
61
+
New options have been added to ImageToPdf. ImageToPdf is the transformer we use to create PDFs from images, we use it for example when we de-identify PDFs and we want to write back to the PDF to obtain a redacted version of the PDF.
62
+
The new options are intended to control the size of the resulting PDF document by controlling the resolution and compression of the images that are included in the PDF,
96
63
64
+
* compression: Compression type for images in PDF document. It can be one of `CompressionType.LOSSLESS`, `CompressionType.JPEG`.
97
65
98
-
### New HocrMerger annotator
99
-
HocrMerger is a new annotator whose purpose is to allow merging two streams of HOCRs texts into a single unified HOCR representation.
100
-
This allows mixing object detection models with text to create a unified document representation that can be fed to other downstream models like Visual NER. The new Checkbox detection pipeline uses this approach.
66
+
* resolution: Resolution in DPI used to render images into the PDF document. There are three sources for the resolution(in decreasing order or precedence): this parameter, the image schema in the input image, or the default value of 300DPI.
101
67
68
+
* quality: Quality of images in PDF document for JPEG compression. A value that ranges between 1.0(best quality) to 0.0(best compression). Defaults to 0.75.
69
+
70
+
* aggregatePages: Aggregate pages in one PDF document.
102
71
103
-
### New Document Clustering Pipeline using Vit Embeddings.
104
-
Now we can use Vit Embeddings to create document representations for clustering.
105
72
73
+
</div><divclass="h3-box"markdown="1">
74
+
75
+
### Support for rotated text regions in ImageToTextV2
76
+
Text regions at the input of ImageToTextV2 support rotation. Detected text regions come with an angle to represent the rotation that the detected text has in the image.
77
+
Now, this angle is used to extract a straightened version of the region, and fed to the OCR. The resulting text is placed into the returned output text using the center of the region to decide its final location.
PATIENT INFORMATION: NAME: HOMER SIMPSON AGE: 40 YEARS
88
+
GENDER: MALE WEIGHT: CLASSIFIED (BUT LET'S JUST SAY ITS IN THE "ROBUST" CATEGORY)
89
+
HEIGHT: 6'0"
90
+
BML: OFF THE CHARTS (LITERALLY)
91
+
OCCUPATION: SAFETY INSPECTOR AT SPRINGFIELD NUCLEAR POWER PLANT
115
92
```
116
-
For an end-to-end example, please check [this notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Clustering/VisualDocumentClustering.ipynb).
117
93
118
-
### Enhancements to color options in ImageDrawRegions.
119
-
ImageDrawRegions is the annotator used for rendering regions into images so we can visualize results from different models, like, for example, Text Detection. The setColorMap method, is used to set the colors of bounding boxes drawn at top of images, the new behavior is as follows,
120
-
121
-
* setColorMap: when called with a '*' as argument, it will apply a single color to the bounding boxes of all labels.
122
-
* setColorMap: when called with a dictionary in the form: {label -> color}, it will apply a different color to each label. If a key is missing, it will pick a random value.
123
-
* setColorMap is not called: random colors will be picked for each label, each time you call transform() a new set of colors will be selected.
94
+
</div><divclass="h3-box"markdown="1">
95
+
96
+
### New Pdf-To-Pdf Pretrained Pipelines for De-Identification.
97
+
New de-ideintification pipeline that consumes PDFs and produces de-identified PDFs: `pdf_deid_pdf_output`.
98
+
For a description of this pipeline please check its
99
+
[card on Models Hub](https://nlp.johnsnowlabs.com/2024/06/12/pdf_deid_subentity_context_augmented_pipeline_en.html), and also this [notebook example](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrPdfDeidSubentityContextAugmentedPipeline.ipynb).
100
+
101
+
</div><divclass="h3-box"markdown="1">
102
+
103
+
### ImageToTextV3 support for HOCR output
104
+
ImageToTextV3 is an LSTM based OCR model that can consume high quality text regions to perform the text recognition. Adding HOCR support to this annotator, allows it to be placed in HOCR pipelines like Visual NER or Table Extraction. Main advantage compared to other OCR models is case sensitivity, and high recall due to the use of independent Text Detection models.
105
+
Check an example [here](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrImageTableRecognitionCaseSensitive.ipynb)
106
+
107
+
</div><divclass="h3-box"markdown="1">
124
108
125
-
### New notebooks & updates
126
-
+ New [notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOCRInfographicsVisualQuestionAnswering.ipynb) for Visual Question Answering on Infographics!
109
+
### Performance Metrics for Deidentification Pipelines
110
+
In order to make it easier for users to estimate runtime figures, we have published the [following metrics](https://nlp.johnsnowlabs.com/docs/en/ocr_benchmark). This metrics corresponds to a pipeline that performs the following actions,
111
+
* Extract PDF pages as images.
112
+
* Perform OCR on these Images.
113
+
* Run NLP De-identification stages(embeddings, NER, etc).
114
+
* Maps PHI entities to regions.
115
+
* Writes PHI regions back to PDF.
116
+
The goal is for these numbers to be used as proxies when estimating hardware requirements of new jobs.
127
117
128
-
+ New [notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrChartToTextLLM.ipynb) for combining Visual NLP Chart Exraction and Open Source LLMs!.
118
+
</div><divclass="h3-box"markdown="1">
129
119
130
-
+ New [notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOCRHandwrittenAndSignatureDetection.ipynb) on Signature and Handwritten text detection.
120
+
### Other Changes & Bug Fixes
121
+
* start() functions now accepts the new `apple_silicon` parameter. apple_silicon: whether to use Apple Silicon binaries or not. Defaults to 'False'.
122
+
* Bug Fix: ImageDrawRegions removes image resolution after drawing regions.
123
+
* Bug Fix: RasterFormatException in ImageToTextV2.
124
+
* Bug Fix: PdfToTextTable, PptToTextTable, DocToTextTable didn't include a `load()` method.
131
125
132
-
### Bug Fixes
133
-
+ PdfToImage resetting page information when used in the same pipeline as PdfToText: When the sequence {PdfToText, PdfToImage} was used the original pages computed at PdfToText where resetted to zero by PdfToImage.
0 commit comments