Skip to content

Add functionality to merge cells in Google OCR prediction #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: add-new-ocr-metrics
Choose a base branch
from

Conversation

samiuc
Copy link
Contributor

@samiuc samiuc commented May 20, 2025

The F1 scores regressed on Google for OCR as compared to the previous evaluations and upon investigating we found out that we had a functionality to merge cells before running the evaluations (only for Google OCR). Here's the overview of what the new code does:

  • Join words that were incorrectly split during scanning
  • Reconnect special characters (hyphens, apostrophes, periods) with their words
  • Fix spacing issues in numbers, dates, and punctuation
  • Make the text easier to read and process by putting split words back together

@samiuc samiuc requested a review from cau-git May 20, 2025 20:01
Copy link
Contributor

@cau-git cau-git left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@samiuc since all of this is only required for Google OCR, please ensure that the code is kept in the google_prediction_provider.py module. We don't want this to spill into general docling-eval utils.

Also, one comment below.

Copy link
Contributor

@PeterStaar-IBM PeterStaar-IBM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments

return y_overlap / y_union if y_union > 0 else 0


def text_cell_to_word_dict(cell: TextCell):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please do not do this, keep using the TextCell!

@@ -42,10 +41,364 @@

_log = logging.getLogger(__name__)

SPECIAL_CHARS = list("*:;,.?()!@#$%^&[]{}/\\\"'~+-_<>=")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont use toplevel constants, please make them part of a class!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants