Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update hocrrenderer.cpp for solving the issue #4045 #4138

Closed
wants to merge 4 commits into from

Conversation

tusharv01
Copy link

To resolve the issue of Tesseract's hOCR output not displaying as XHTML in Chrome, modifications were made in the hocrrenderer.cpp file. These changes ensure proper generation of hOCR markup, including text direction and baseline information, allowing Chrome to correctly render the output as expected in an XHTML format.

To resolve the issue of Tesseract's hOCR output not displaying as XHTML in Chrome, modifications were made in the hocrrenderer.cpp file. These changes ensure proper generation of hOCR markup, including text direction and baseline information, allowing Chrome to correctly render the output as expected in an XHTML format.

return true;
return true;
Copy link
Contributor

@stweil stweil Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check why the indentation in your pull request changed. Maybe your editor inserted tabs or additional blanks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have fixed the indentation at line 526. PLease review and merge my PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other lines still have a wrong indentation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have commit the changes in the code , please review and give your review.

Comment on lines 506 to 526
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n"
" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n"
"<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" "
"lang=\"en\">\n <head>\n <title>");
AppendString(title());
AppendString(
"</title>\n"
" <meta http-equiv=\"Content-Type\" content=\"text/html;"
"charset=utf-8\"/>\n"
" <meta name='ocr-system' content='tesseract " TESSERACT_VERSION_STR
"' />\n"
" <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par"
" ocr_line ocrx_word ocrp_wconf");
if (font_info_) {
AppendString(" ocrp_lang ocrp_dir ocrp_font ocrp_fsize");
}
AppendString(
"'/>\n"
" </head>\n"
" <body>\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n"
" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n"
"<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" "
"lang=\"en\">\n <head>\n <title>");
AppendString(title());
AppendString(
"</title>\n"
" <meta http-equiv=\"Content-Type\" content=\"text/html;"
"charset=utf-8\"/>\n"
" <meta name='ocr-system' content='tesseract " TESSERACT_VERSION_STR
"' />\n"
" <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par"
" ocr_line ocrx_word ocrp_wconf");
if (font_info_) {
AppendString(" ocrp_lang ocrp_dir ocrp_font ocrp_fsize");
}
AppendString(
"'/>\n"
" </head>\n"
" <body>\n");
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n"
" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n"
"<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" "
"lang=\"en\">\n <head>\n <title>");
AppendString(title());
AppendString(
"</title>\n"
" <meta http-equiv=\"Content-Type\" content=\"text/html;"
"charset=utf-8\"/>\n"
" <meta name='ocr-system' content='tesseract " TESSERACT_VERSION_STR
"' />\n"
" <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par"
" ocr_line ocrx_word ocrp_wconf");
if (font_info_) {
AppendString(" ocrp_lang ocrp_dir ocrp_font ocrp_fsize");
}
AppendString(
"'/>\n"
" </head>\n"
" <body>\n");

To resolve the issue of Tesseract's hOCR output not displaying as XHTML in Chrome, modifications were made in the hocrrenderer.cpp file. These changes ensure proper generation of hOCR markup, including text direction and baseline information, allowing Chrome to correctly render the output as expected in an XHTML format.
Comment on lines 503 to 504
SetContentType("application/xhtml+xml");
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this code. Where is SetContentType defined? And why do you add a return statement which makes the following code useless?

The modifications in hocrrenderer.cpp aim to address a rendering issue in Tesseract's hOCR output, ensuring it conforms to XHTML standards. These changes enable proper text direction and baseline information, resulting in correct display in Chrome.
@stweil stweil marked this pull request as draft October 5, 2023 12:54
Copy link
Contributor

@stweil stweil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still unclear how this pull request would fix issue #4045. In addition the changes add unnecessary empty lines and change the indentation which violates the coding conventions for Tesseract.

This code addresses the issue by ensuring Tesseract's hOCR output complies with XHTML standards. It includes necessary metadata and formatting for text direction and baseline information, enabling correct rendering in web browsers like Chrome, ultimately resolving the rendering issue (tesseract-ocr#4045).
@stweil
Copy link
Contributor

stweil commented Oct 5, 2023

Please test your code changes locally before sending a pull request. All build tests fail because SetContentType is unknown.

@stweil stweil closed this Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants