Skip to content

Commit cf71287

Browse files
Ocr default page (#1353)
* added release notes for OCR 5.2.0 * update ind efault page
1 parent 598de78 commit cf71287

File tree

1 file changed

+84
-92
lines changed

1 file changed

+84
-92
lines changed

docs/en/spark_ocr_versions/ocr_release_notes.md

Lines changed: 84 additions & 92 deletions
Original file line numberDiff line numberDiff line change
@@ -13,129 +13,121 @@ sidebar:
1313

1414
<div class="h3-box" markdown="1">
1515

16-
## 5.3.1
16+
## 5.4.0
1717

18-
Release date: 11-04-2024
18+
Release date: 15-07-2024
1919

2020

21-
## Visual NLP 5.3.1 Release Notes 🕶️
21+
## Visual NLP 5.4.0 Release Notes 🕶️
2222

23+
**we're glad to announce that Visual NLP 5.4.0 has been released. New transformers, notebooks, metrics, bug fixes and more!!! 📢📢📢**
2324

24-
**we're glad to announce that Visual NLP 5.3.1 has been released. New models, notebooks, bug fixes and more!!! 📢📢📢**
25-
25+
</div><div class="h3-box" markdown="1">
2626

2727
## Highlights 🔴
2828

29-
+ Improved table extraction capabilities in HocrToTextTable.
30-
+ Improvements in LightPipelines.
31-
+ Confidence scores in ImageToTextV2.
32-
+ New HocrMerger annotator.
33-
+ Checkbox detection in Visual NER.
34-
+ New Document Clustering Pipeline using Vit Embeddings.
35-
+ Enhancements to color options in ImageDrawRegions.
36-
+ New notebooks & updates.
37-
+ Bug Fixes.
38-
39-
## Improved table extraction capabilities in HocrToTextTable
40-
Many issues related to column detection in our Table Extraction pipelines are addressed in this release, compared to previous Visual NLP version the metrics have improved. Table below shows F1-score(CAR or Cell Adjacency Relationship) performances on ICDAR 19 Track B dataset for different IoU values of our two versions in comparison with [other results](https://paperswithcode.com/paper/multi-type-td-tsr-extracting-tables-from/review/).
41-
42-
{:.table-model-big}
43-
| Model | 0.6 | 0.7 | 0.8 | 0.9 |
44-
| ------------- | ------------- |------------- |------------- |------------- |
45-
| CascadeTabNet | 0.438 | 0.354 | 0.19 | 0.036 |
46-
| NLPR-PAL | 0.365 | 0.305 | 0.195 | 0.035 |
47-
| Multi-Type-TD-TSR | 0.589 | 0.404 | 0.137 | 0.015 |
48-
| Visual NLP 5.3.0 | 0.463 | 0.420 | 0.355 | 0.143 |
49-
| Visual NLP 5.3.1 | 0.509 | **0.477** | **0.403** | **0.162** |
50-
51-
52-
### Improvements in LightPipelines
53-
* ImageSplitRegions, ImageToTextV2, ImageTextDetectorCraft are now supported to be used with LightPipelines.
54-
* New `Base64ToBinary()` annotator to enable the use of in-memory base64 string buffers as input to LightPipelines.
29+
+ Improvements in Table Processing.
30+
+ Dicom Transformers access to S3 directly.
31+
+ New Options for ImageToPdf transformer.
32+
+ Support for rotated text regions in ImageToTextV2.
33+
+ New Pdf-To-Pdf Pretrained Pipeline for De-Identification.
34+
+ ImageToTextV3 support for HOCR output.
35+
+ Performance Metrics for De-identification Pipelines.
36+
+ Other Changes.
5537

56-
```
57-
from sparkocr.enums import ImageType
58-
# Transform base64 to binary
59-
base64_to_bin = Base64ToBinary()
60-
base64_to_bin.setOutputCol("content")
61-
62-
pdf_to_image = PdfToImage()
63-
pdf_to_image.setInputCol("content")
64-
pdf_to_image.setOutputCol("image")
65-
66-
# Run OCR for each region
67-
ocr = ImageToText()
68-
ocr.setInputCol("image")
69-
ocr.setOutputCol("text")
70-
71-
# OCR pipeline
72-
pipeline = PipelineModel(stages=[
73-
base64_to_bin,
74-
pdf_to_image,
75-
ocr
76-
])
77-
lp = LightPipeline(pipeline)
78-
result = lp.fromString(base64_pdf)
79-
```
38+
</div><div class="h3-box" markdown="1">
8039

81-
[Full Example here.](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrLightPipelinesBase64.ipynb)
82-
* new `LightPipeline.fromBinary()` method that allows the usage of in-memory binary buffers as inputs to Visual NLP pipelines.
40+
### Improvements in Table Processing
41+
New RegionsMerger component for merging Text Regions and Cell Regions to improve accuracy in Table Extraction Pipelines:
8342

84-
### Confidence scores in ImageToTextV2
85-
You can now enable confidence scores in ImageToTextV2 like this,
43+
| PretrainedPipeline | Score Improvement(*) | Comments
44+
| ------------------ | --------------------- |---------------------------|
45+
| table_extractor_image_to_text_v2 | 0.34 to 0.5 | Internally it uses ImageToTextV2(case insensitive)|
46+
| table_extractor_image_to_text_v1 | 0.711 to 0.728 | Internally it uses ImageToText(case sensitive)|
8647

87-
```
88-
ocr = ImageToTextV2.pretrained("ocr_base_printed_v2_opt", "en", "clinical/ocr") \
89-
.setIncludeConfidence(True)
90-
```
48+
(*) This is the cell adjacency Table Extraction metric as defined by ICDAR Table Extraction Challenge.
49+
The improvements are measured against previous release of Visual NLP.
50+
51+
</div><div class="h3-box" markdown="1">
9152

92-
![Confidence scores in ImageToTextV2](/assets/images/ocr/confidence_score.png)
53+
### Dicom Transformers access to S3 directly
54+
Now Dicom Transformers can access S3 directly from executors instead of reading through the Spark Dataframe. This is particularly advantageous in the situation where we only care about the metadata of each file because we don't need to load the entire file into memory, also,
55+
* It reduces memory usage and allows processing of files larger than 2 GB (a limitation of Spark).
56+
* It improves performance when computing statistics over large DICOM datasets.
9357

94-
Check this [updated notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/TextRecognition/SparkOcrImageToTextV2.ipynb) for an end-to-end example.
58+
</div><div class="h3-box" markdown="1">\
9559

60+
### New Options for ImageToPdf transformer.
61+
New options have been added to ImageToPdf. ImageToPdf is the transformer we use to create PDFs from images, we use it for example when we de-identify PDFs and we want to write back to the PDF to obtain a redacted version of the PDF.
62+
The new options are intended to control the size of the resulting PDF document by controlling the resolution and compression of the images that are included in the PDF,
9663

64+
* compression: Compression type for images in PDF document. It can be one of `CompressionType.LOSSLESS`, `CompressionType.JPEG`.
9765

98-
### New HocrMerger annotator
99-
HocrMerger is a new annotator whose purpose is to allow merging two streams of HOCRs texts into a single unified HOCR representation.
100-
This allows mixing object detection models with text to create a unified document representation that can be fed to other downstream models like Visual NER. The new Checkbox detection pipeline uses this approach.
66+
* resolution: Resolution in DPI used to render images into the PDF document. There are three sources for the resolution(in decreasing order or precedence): this parameter, the image schema in the input image, or the default value of 300DPI.
10167

68+
* quality: Quality of images in PDF document for JPEG compression. A value that ranges between 1.0(best quality) to 0.0(best compression). Defaults to 0.75.
69+
70+
* aggregatePages: Aggregate pages in one PDF document.
10271

103-
### New Document Clustering Pipeline using Vit Embeddings.
104-
Now we can use Vit Embeddings to create document representations for clustering.
10572

73+
</div><div class="h3-box" markdown="1">
74+
75+
### Support for rotated text regions in ImageToTextV2
76+
Text regions at the input of ImageToTextV2 support rotation. Detected text regions come with an angle to represent the rotation that the detected text has in the image.
77+
Now, this angle is used to extract a straightened version of the region, and fed to the OCR. The resulting text is placed into the returned output text using the center of the region to decide its final location.
78+
See the following example,
79+
![image](/assets/images/ocr/rotated_regions.png)
80+
81+
and the resulting(truncated) text,
10682
```
107-
binary_to_image = BinaryToImage() \
108-
.setInputCol("content") \
109-
.setOutputCol("image")
110-
111-
embeddings = VitImageEmbeddings \
112-
.pretrained("vit_image_embeddings", "en", "clinical/ocr") \
113-
.setInputCol("image") \
114-
.setOutputCol("embeddings")
83+
SUBURBAN HOSPITAL
84+
HEALTHCARE SYSTEM
85+
APPROVED ROTATED TEXT
86+
MEDICAL RECORD
87+
PATIENT INFORMATION: NAME: HOMER SIMPSON AGE: 40 YEARS
88+
GENDER: MALE WEIGHT: CLASSIFIED (BUT LET'S JUST SAY ITS IN THE "ROBUST" CATEGORY)
89+
HEIGHT: 6'0"
90+
BML: OFF THE CHARTS (LITERALLY)
91+
OCCUPATION: SAFETY INSPECTOR AT SPRINGFIELD NUCLEAR POWER PLANT
11592
```
116-
For an end-to-end example, please check [this notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/Clustering/VisualDocumentClustering.ipynb).
11793

118-
### Enhancements to color options in ImageDrawRegions.
119-
ImageDrawRegions is the annotator used for rendering regions into images so we can visualize results from different models, like, for example, Text Detection. The setColorMap method, is used to set the colors of bounding boxes drawn at top of images, the new behavior is as follows,
120-
121-
* setColorMap: when called with a '*' as argument, it will apply a single color to the bounding boxes of all labels.
122-
* setColorMap: when called with a dictionary in the form: {label -> color}, it will apply a different color to each label. If a key is missing, it will pick a random value.
123-
* setColorMap is not called: random colors will be picked for each label, each time you call transform() a new set of colors will be selected.
94+
</div><div class="h3-box" markdown="1">
95+
96+
### New Pdf-To-Pdf Pretrained Pipelines for De-Identification.
97+
New de-ideintification pipeline that consumes PDFs and produces de-identified PDFs: `pdf_deid_pdf_output`.
98+
For a description of this pipeline please check its
99+
[card on Models Hub](https://nlp.johnsnowlabs.com/2024/06/12/pdf_deid_subentity_context_augmented_pipeline_en.html), and also this [notebook example](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrPdfDeidSubentityContextAugmentedPipeline.ipynb).
100+
101+
</div><div class="h3-box" markdown="1">
102+
103+
### ImageToTextV3 support for HOCR output
104+
ImageToTextV3 is an LSTM based OCR model that can consume high quality text regions to perform the text recognition. Adding HOCR support to this annotator, allows it to be placed in HOCR pipelines like Visual NER or Table Extraction. Main advantage compared to other OCR models is case sensitivity, and high recall due to the use of independent Text Detection models.
105+
Check an example [here](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrImageTableRecognitionCaseSensitive.ipynb)
106+
107+
</div><div class="h3-box" markdown="1">
124108

125-
### New notebooks & updates
126-
+ New [notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOCRInfographicsVisualQuestionAnswering.ipynb) for Visual Question Answering on Infographics!
109+
### Performance Metrics for Deidentification Pipelines
110+
In order to make it easier for users to estimate runtime figures, we have published the [following metrics](https://nlp.johnsnowlabs.com/docs/en/ocr_benchmark). This metrics corresponds to a pipeline that performs the following actions,
111+
* Extract PDF pages as images.
112+
* Perform OCR on these Images.
113+
* Run NLP De-identification stages(embeddings, NER, etc).
114+
* Maps PHI entities to regions.
115+
* Writes PHI regions back to PDF.
116+
The goal is for these numbers to be used as proxies when estimating hardware requirements of new jobs.
127117

128-
+ New [notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrChartToTextLLM.ipynb) for combining Visual NLP Chart Exraction and Open Source LLMs!.
118+
</div><div class="h3-box" markdown="1">
129119

130-
+ New [notebook](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOCRHandwrittenAndSignatureDetection.ipynb) on Signature and Handwritten text detection.
120+
### Other Changes & Bug Fixes
121+
* start() functions now accepts the new `apple_silicon` parameter. apple_silicon: whether to use Apple Silicon binaries or not. Defaults to 'False'.
122+
* Bug Fix: ImageDrawRegions removes image resolution after drawing regions.
123+
* Bug Fix: RasterFormatException in ImageToTextV2.
124+
* Bug Fix: PdfToTextTable, PptToTextTable, DocToTextTable didn't include a `load()` method.
131125

132-
### Bug Fixes
133-
+ PdfToImage resetting page information when used in the same pipeline as PdfToText: When the sequence {PdfToText, PdfToImage} was used the original pages computed at PdfToText where resetted to zero by PdfToImage.
134126

135127
</div><div class="prev_ver h3-box" markdown="1">
136128

137129
## Previous versions
138130

139131
</div>
140132

141-
{%- include docs-sparckocr-pagination.html -%}
133+
{%- include docs-sparckocr-pagination.html -%}

0 commit comments

Comments
 (0)