-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update hocrrenderer.cpp for solving the issue #4045 #4138
Conversation
To resolve the issue of Tesseract's hOCR output not displaying as XHTML in Chrome, modifications were made in the hocrrenderer.cpp file. These changes ensure proper generation of hOCR markup, including text direction and baseline information, allowing Chrome to correctly render the output as expected in an XHTML format.
src/api/hocrrenderer.cpp
Outdated
|
||
return true; | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check why the indentation in your pull request changed. Maybe your editor inserted tabs or additional blanks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have fixed the indentation at line 526. PLease review and merge my PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other lines still have a wrong indentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have commit the changes in the code , please review and give your review.
src/api/hocrrenderer.cpp
Outdated
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" | ||
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n" | ||
" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" | ||
"<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" " | ||
"lang=\"en\">\n <head>\n <title>"); | ||
AppendString(title()); | ||
AppendString( | ||
"</title>\n" | ||
" <meta http-equiv=\"Content-Type\" content=\"text/html;" | ||
"charset=utf-8\"/>\n" | ||
" <meta name='ocr-system' content='tesseract " TESSERACT_VERSION_STR | ||
"' />\n" | ||
" <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par" | ||
" ocr_line ocrx_word ocrp_wconf"); | ||
if (font_info_) { | ||
AppendString(" ocrp_lang ocrp_dir ocrp_font ocrp_fsize"); | ||
} | ||
AppendString( | ||
"'/>\n" | ||
" </head>\n" | ||
" <body>\n"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" | |
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n" | |
" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" | |
"<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" " | |
"lang=\"en\">\n <head>\n <title>"); | |
AppendString(title()); | |
AppendString( | |
"</title>\n" | |
" <meta http-equiv=\"Content-Type\" content=\"text/html;" | |
"charset=utf-8\"/>\n" | |
" <meta name='ocr-system' content='tesseract " TESSERACT_VERSION_STR | |
"' />\n" | |
" <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par" | |
" ocr_line ocrx_word ocrp_wconf"); | |
if (font_info_) { | |
AppendString(" ocrp_lang ocrp_dir ocrp_font ocrp_fsize"); | |
} | |
AppendString( | |
"'/>\n" | |
" </head>\n" | |
" <body>\n"); | |
"<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n" | |
"<!DOCTYPE html PUBLIC \"-//W3C//DTD XHTML 1.0 Transitional//EN\"\n" | |
" \"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd\">\n" | |
"<html xmlns=\"http://www.w3.org/1999/xhtml\" xml:lang=\"en\" " | |
"lang=\"en\">\n <head>\n <title>"); | |
AppendString(title()); | |
AppendString( | |
"</title>\n" | |
" <meta http-equiv=\"Content-Type\" content=\"text/html;" | |
"charset=utf-8\"/>\n" | |
" <meta name='ocr-system' content='tesseract " TESSERACT_VERSION_STR | |
"' />\n" | |
" <meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par" | |
" ocr_line ocrx_word ocrp_wconf"); | |
if (font_info_) { | |
AppendString(" ocrp_lang ocrp_dir ocrp_font ocrp_fsize"); | |
} | |
AppendString( | |
"'/>\n" | |
" </head>\n" | |
" <body>\n"); |
To resolve the issue of Tesseract's hOCR output not displaying as XHTML in Chrome, modifications were made in the hocrrenderer.cpp file. These changes ensure proper generation of hOCR markup, including text direction and baseline information, allowing Chrome to correctly render the output as expected in an XHTML format.
src/api/hocrrenderer.cpp
Outdated
SetContentType("application/xhtml+xml"); | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this code. Where is SetContentType
defined? And why do you add a return
statement which makes the following code useless?
The modifications in hocrrenderer.cpp aim to address a rendering issue in Tesseract's hOCR output, ensuring it conforms to XHTML standards. These changes enable proper text direction and baseline information, resulting in correct display in Chrome.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still unclear how this pull request would fix issue #4045. In addition the changes add unnecessary empty lines and change the indentation which violates the coding conventions for Tesseract.
This code addresses the issue by ensuring Tesseract's hOCR output complies with XHTML standards. It includes necessary metadata and formatting for text direction and baseline information, enabling correct rendering in web browsers like Chrome, ultimately resolving the rendering issue (tesseract-ocr#4045).
Please test your code changes locally before sending a pull request. All build tests fail because |
To resolve the issue of Tesseract's hOCR output not displaying as XHTML in Chrome, modifications were made in the hocrrenderer.cpp file. These changes ensure proper generation of hOCR markup, including text direction and baseline information, allowing Chrome to correctly render the output as expected in an XHTML format.