Skip to content

Commit

Permalink
Visual nlp 5.4.1 release notes (#1509)
Browse files Browse the repository at this point in the history
* added release notes for OCR 5.2.0

* added visual nlp 5.4.1 rn

* updated html idx
  • Loading branch information
albertoandreottiATgmail authored Sep 29, 2024
1 parent 1c2da5e commit 4be6df1
Show file tree
Hide file tree
Showing 3 changed files with 109 additions and 157 deletions.
1 change: 1 addition & 0 deletions docs/_includes/docs-sparckocr-pagination.html
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
</li>
</ul>
<ul class="pagination owl-carousel pagination_big">
<li><a href="release_notes_5_4_1">5.4.1</a></li>
<li><a href="release_notes_5_4_0">5.4.0</a></li>
<li><a href="release_notes_5_3_2">5.3.2</a></li>
<li><a href="release_notes_5_3_1">5.3.1</a></li>
Expand Down
131 changes: 53 additions & 78 deletions docs/en/spark_ocr_versions/ocr_release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,128 +5,103 @@ seotitle: Spark OCR | John Snow Labs
title: Spark OCR release notes
permalink: /docs/en/spark_ocr_versions/ocr_release_notes
key: docs-ocr-release-notes
modify_date: "2024-02-23"
modify_date: "2024-09-26"
show_nav: true
sidebar:
nav: sparknlp-healthcare
---

<div class="h3-box" markdown="1">

## 5.4.0
## 5.4.1

Release date: 15-07-2024
Release date: 26-09-2024

## Visual NLP 5.4.0 Release Notes 🕶️
## Visual NLP 5.4.1 Release Notes 🕶️

**we're glad to announce that Visual NLP 5.4.0 has been released. New transformers, notebooks, metrics, bug fixes and more!!! 📢📢📢**
**We are glad to announce that Visual NLP 5.4.1 has been released.
This release comes with new models, notebooks, examples, and bug fixes!!! 📢📢📢**

</div><div class="h3-box" markdown="1">

## Highlights 🔴

+ Improvements in Table Processing.
+ Dicom Transformers access to S3 directly.
+ New Options for ImageToPdf transformer.
+ Support for rotated text regions in ImageToTextV2.
+ New Pdf-To-Pdf Pretrained Pipeline for De-Identification.
+ ImageToTextV3 support for HOCR output.
+ Performance Metrics for De-identification Pipelines.
+ Other Changes.
* New models for information extraction from Scanned Forms, scoring F1 92.86 on the FUNSD Entity Extraction dataset, and 89.23 on FUNSD Relation Extraction task. This significantly outperforms the Form Understanding capabilities of the AWS, Azure, and Google ‘Form AI’ services.
* New blogpost on Form Extraction in which we deep dive into the advantages of the models released today against different cloud providers.
* New PDF Deidentification Pipeline, that ingests PDFs and returns a de-identified version of the PDF.

</div><div class="h3-box" markdown="1">

### Improvements in Table Processing
New RegionsMerger component for merging Text Regions and Cell Regions to improve accuracy in Table Extraction Pipelines:
## Other Changes 🟡
* Memory Improvements for Models that cut memory requirements of most models by half.
* Minor enhancements, changes, and bug fixes that ensure the quality of the library continues to improve over time.

{:.table-model-big}
| PretrainedPipeline | Score Improvement(*) | Comments
| ------------------ | --------------------- |---------------------------|
| table_extractor_image_to_text_v2 | 0.34 to 0.5 | Internally it uses ImageToTextV2(case insensitive)|
| table_extractor_image_to_text_v1 | 0.711 to 0.728 | Internally it uses ImageToText(case sensitive)|

(*) This is the cell adjacency Table Extraction metric as defined by ICDAR Table Extraction Challenge.
The improvements are measured against previous release of Visual NLP.

</div><div class="h3-box" markdown="1">

### Dicom Transformers access to S3 directly
Now Dicom Transformers can access S3 directly from executors instead of reading through the Spark Dataframe. This is particularly advantageous in the situation where we only care about the metadata of each file because we don't need to load the entire file into memory, also,
* It reduces memory usage and allows processing of files larger than 2 GB (a limitation of Spark).
* It improves performance when computing statistics over large DICOM datasets.
### New models for information extraction from Scanned Forms
This model scores F1 92.86 on the FUNSD Entity Extraction dataset, and 89.23 on FUNSD Relation Extraction task. This significantly outperforms the Form Understanding capabilities of the AWS, Azure, and Google ‘Form AI’ services.

</div><div class="h3-box" markdown="1">
Two new annotators that work together were added:

### New Options for ImageToPdf transformer.
New options have been added to ImageToPdf. ImageToPdf is the transformer we use to create PDFs from images, we use it for example when we de-identify PDFs and we want to write back to the PDF to obtain a redacted version of the PDF.
The new options are intended to control the size of the resulting PDF document by controlling the resolution and compression of the images that are included in the PDF,

* compression: Compression type for images in PDF document. It can be one of `CompressionType.LOSSLESS`, `CompressionType.JPEG`.
* VisualDocumentNerGeo: This is a Visual NER model based on Geolayout, which achieves a F1-score of 92.86 on the Entity Extraction part of FUNSD. To use it call,
```
ner = VisualDocumentNerGeo().
pretrained("visual_ner_geo_v1", "en", "clinical/ocr/")
```

* resolution: Resolution in DPI used to render images into the PDF document. There are three sources for the resolution(in decreasing order or precedence): this parameter, the image schema in the input image, or the default value of 300DPI.
* GeoRelationExtractor: This is a Relation Extraction model based on Geolayout, which achieves a F1-score of 89.45 on the Relation Extraction part of FUNSD. To use it call,

* quality: Quality of images in PDF document for JPEG compression. A value that ranges between 1.0(best quality) to 0.0(best compression). Defaults to 0.75.

* aggregatePages: Aggregate pages in one PDF document.
```
re = GeoRelationExtractor().
pretrained("visual_re_geo_v2", "en", "clinical/ocr/")
```

</div><div class="h3-box" markdown="1">
To see these two models combined on a real example, please check this [sample notebook](https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/FormRecognition/FormRecognitionGeo.ipynb).

### Support for rotated text regions in ImageToTextV2
Text regions at the input of ImageToTextV2 support rotation. Detected text regions come with an angle to represent the rotation that the detected text has in the image.
Now, this angle is used to extract a straightened version of the region, and fed to the OCR. The resulting text is placed into the returned output text using the center of the region to decide its final location.
See the following example.

![Support for rotated text regions in ImageToTextV2](/assets/images/ocr/rotated_regions.png)

and the resulting(truncated) text,
```
SUBURBAN HOSPITAL
HEALTHCARE SYSTEM
APPROVED ROTATED TEXT
MEDICAL RECORD
PATIENT INFORMATION: NAME: HOMER SIMPSON AGE: 40 YEARS
GENDER: MALE WEIGHT: CLASSIFIED (BUT LET'S JUST SAY ITS IN THE "ROBUST" CATEGORY)
HEIGHT: 6'0"
BML: OFF THE CHARTS (LITERALLY)
OCCUPATION: SAFETY INSPECTOR AT SPRINGFIELD NUCLEAR POWER PLANT
```
</div><div class="h3-box" markdown="1">

### New PDF Deidentification Pipeline
A new PDF deidentification pipeline, `pdf_deid_subentity_context_augmented_pipeline` has been added to the library. This new pipeline has the ability to ingest PDFs, apply PHI masking through bounding boxes, and re-create the input PDF as a new document in which the PHI has been removed. The pipeline is ready to use and requires no configuration to handle most use cases.</br>
You can check an example of this pipeline in action in [this notebook](https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrPdfDeidSubentityContextAugmentedPipeline.ipynb).
</div><div class="h3-box" markdown="1">

### New Pdf-To-Pdf Pretrained Pipelines for De-Identification.
New de-ideintification pipeline that consumes PDFs and produces de-identified PDFs: `pdf_deid_pdf_output`.
For a description of this pipeline please check its [card on Models Hub](https://nlp.johnsnowlabs.com/2024/06/12/pdf_deid_subentity_context_augmented_pipeline_en.html), and also this [notebook example](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrPdfDeidSubentityContextAugmentedPipeline.ipynb).
### Memory Improvements for Models

All ONNX models have been refactored to reduce the memory footprint of the graph. There are two sources of memory problems in ML: models and data. Here we tackle model memory consumption by cutting the memory requirements of models by half.

</div><div class="h3-box" markdown="1">

### ImageToTextV3 support for HOCR output
ImageToTextV3 is an LSTM based OCR model that can consume high quality text regions to perform the text recognition. Adding HOCR support to this annotator, allows it to be placed in HOCR pipelines like Visual NER or Table Extraction. Main advantage compared to other OCR models is case sensitivity, and high recall due to the use of independent Text Detection models.
Check an example [here](https://github.com/JohnSnowLabs/spark-ocr-workshop/blob/master/jupyter/SparkOcrImageTableRecognitionCaseSensitive.ipynb)
### Minor enhancements, changes, and bug fixes.

</div><div class="h3-box" markdown="1">
* New `display_xml()` function for displaying tables as XMLs: similar to existing `display_tables()` function, but with XML output instead of Jupyter markdown.

### Performance Metrics for Deidentification Pipelines
In order to make it easier for users to estimate runtime figures, we have published the [following metrics](https://nlp.johnsnowlabs.com/docs/en/ocr_benchmark). This metrics corresponds to a pipeline that performs the following actions,
* Extract PDF pages as images.
* Perform OCR on these Images.
* Run NLP De-identification stages(embeddings, NER, etc).
* Maps PHI entities to regions.
* Writes PHI regions back to PDF.
The goal is for these numbers to be used as proxies when estimating hardware requirements of new jobs.
* Enhanced memory management on ImageToTextV2: lifecycle of ONNX session is now aligned with Spark query plan. This means that models are instantiated only one time for each partition, and no leaks occur across multiple calls to transform() on the same pipeline. This results in a more efficient utilisation of memory.

</div><div class="h3-box" markdown="1">
* GPU support in Pix2Struct models: Pix2struct checkpoints for Chart Extraction and Visual Question Answering can leverage GPU like this,

```
visual_question_answering = VisualQuestionAnswering()\
.pretrained("info_docvqa_pix2struct_jsl_base_opt", "en", "clinical/ocr")\
.setUseGPU(True)
```

* Better support for different coordinate formats: we improved the way in which coordinates are handled within Visual NLP. Now, each annotator can detect whether coordinates being fed are regular coordinates or rotated coordinates. Forget about things like `ImageDrawRegions.setRotated()` to choose a specific input format, now everything happens automatically behind the scenes.

### Other Changes & Bug Fixes
* start() functions now accepts the new `apple_silicon` parameter. apple_silicon: whether to use Apple Silicon binaries or not. Defaults to 'False'.
* Bug Fix: ImageDrawRegions removes image resolution after drawing regions.
* Bug Fix: RasterFormatException in ImageToTextV2.
* Bug Fix: PdfToTextTable, PptToTextTable, DocToTextTable didn't include a `load()` method.
* New blogpost on Form Extraction
We have recently released a new [Medium blogpost](https://medium.com/john-snow-labs/visual-document-understanding-benchmark-comparative-analysis-of-in-house-and-cloud-based-form-75f6fbf1ae5f) where we compare our in-house models against different cloud providers. We provide metrics and discuss the results.

Key takeaways are that first, Visual NLP's small models can beat cloud providers while at the same time remaining fast and providing more deployment options. Second, in order to obtain a model to be used in practice, fine tuning is mandatory.


This release is compatible with Spark-NLP 5.4.1, and Spark NLP for Healthcare 5.4.1.
</div><div class="h3-box" markdown="1">

</div><div class="prev_ver h3-box" markdown="1">

## Previous versions

</div>

{%- include docs-sparckocr-pagination.html -%}
{%- include docs-sparckocr-pagination.html -%}
Loading

0 comments on commit 4be6df1

Please sign in to comment.