Highlight OCR #1580

wgilling · 2020-08-12T17:24:33Z

Tesseract can be used to make HOCR again, but there are many challenges to display this.

This likely does not apply to objects that would be viewed using the PDFjs viewer because that handles search highlighting.

editing OCR would be so much more tricky -- considering that the HOCR file would need to potentially be updated as well
displaying the rectangles per search term was a much easier concept for the HTML via the CSS classes in the HOCR file, but a challenge would be how to use these rectangles to make the overlays in the OpenSeadragon viewer.
make a corresponding actions trigger that can be used to generate to any objects that have already been ingested (similarly to the action to "Index node in Fedora")

jasonhildebrand · 2022-07-28T21:11:33Z

I understand that Islandora 8 does not support the ability to highlight search results when using openseadragon. I'm contributing our use case in the hopes that this feature will be prioritized soon.

In our case, we are digitizing PDF files using Abbyy Finereader, which supports OCR of German Gothic script (fraktur). It produces PDF files containing the scanned image, as well as the OCR'd text in a separate layer. You can open one of these files in a PDF reader and search it, and it will correctly highlight the location of the matching text.

When we import into Islandora 8, the PDF is converted to a service image, and this is displayed using openseadragon.
To support our use case, I suppose that Islandora would need to determine the location of matched text using the uploaded PDF (since this information is not contained in the JPG service file), then produce overlay information for openseadragon.

seth-shaw-asu · 2022-08-10T20:43:40Z

The Islandora-Lite folks @ the University of Toronto Scarborough (tagging @kstapelfeldt and @Natkeeran) did a demonstration of their setup during IslandoraCon 2022 which included improvements in viewer-supported OCR. I believe they were using annotations served via IIIF, but I don't recall details. I look forward to watching their presentation again when it gets posted.

Natkeeran · 2022-08-17T14:29:28Z

To clarify, it is an early prototype. Please see additional info here: https://github.com/digitalutsc/islandora_lite_docs/wiki/Mirador-Search-and-Annotations-(Prototype)

@alxp (UPEI) is also looking into this feature.

wgilling · 2022-08-24T17:19:33Z

I'd love to first explore the Mirador Search and Annotations (Prototype) and work with @alxp on this solution since it seems like anybody who is using mirador already would be able to use this.

Jordan Dukart had referenced the mirador-textoverlay code here https://github.com/dbmdz/mirador-textoverlay and said that this was what UTSC and CMU were using, but mirador likely does not take an HOCR file per page but rather an intermediate format.

Also, Don Richards mentioned this https://dbmdz.github.io/solr-ocrhighlighting/0.8.1/ while he was researching the topic.

jasonhildebrand · 2022-10-31T17:38:00Z

FYI, we have implemented a solution to our use case which I noted earlier. Here is our approach at a high-level:

implemented a microservice which accepts a PDF URL (of a PDF in Fedora) and search terms as input, then extracts text from the PDF file and returns bounding boxes of matching terms. We implement some fuzziness using in order to match words which don't match 100% (because we SOLR may be doing word-stemming, etc.).
customized islandora/openseadragon to query the microservice, obtain locations of matching terms, and create overlays to highlight the terms
we implemented a REST view in Islandora that, given a node of model = Page, allows us to fetch the URLs of the original PDF file of that node. Our openseadragon customization uses this info.
small customization to Islandora to retain search terms in the query string when clicking on a search result

This approach was driven largely by the format of our source PDFs (and the need to complete our project on-budget). I don't know whether it is of interest to the Islandora community or not, but thought I would post here in case anyone is interested.

kstapelfeldt added the enhancement label Sep 9, 2021

kstapelfeldt added Type: enhancement Identifies work on an enhancement to the Islandora codebase and removed enhancement labels Sep 25, 2021

rosiel mentioned this issue Oct 22, 2021

Use Case: OCR is searchable and i can tell it a language. #1957

Open

amyrb mentioned this issue Nov 10, 2021

Use Case: Book Viewer #1923

Open

alxp mentioned this issue Sep 7, 2022

Add hOCR option to Text Extraction Media Attachment action and IIIF Manifest Islandora/islandora#897

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlight OCR #1580

Highlight OCR #1580

wgilling commented Aug 12, 2020

jasonhildebrand commented Jul 28, 2022

seth-shaw-asu commented Aug 10, 2022

Natkeeran commented Aug 17, 2022

wgilling commented Aug 24, 2022 •

edited

Loading

jasonhildebrand commented Oct 31, 2022 •

edited

Loading

Highlight OCR #1580

Highlight OCR #1580

Comments

wgilling commented Aug 12, 2020

jasonhildebrand commented Jul 28, 2022

seth-shaw-asu commented Aug 10, 2022

Natkeeran commented Aug 17, 2022

wgilling commented Aug 24, 2022 • edited Loading

jasonhildebrand commented Oct 31, 2022 • edited Loading

wgilling commented Aug 24, 2022 •

edited

Loading

jasonhildebrand commented Oct 31, 2022 •

edited

Loading