Inconsistent information between getText("dict")['blocks'] and getText("html") #956

Yichen-fqyd · 2021-03-17T21:10:53Z

Yichen-fqyd
Mar 17, 2021

Hello<

I really appreciate this great repo!
I am trying to add bounding box information to the extracted html to the corresponding p tag, two problems found during the process.

there is more p tags than provided bbox information in "dict",
the "dict" will return the (0.0, 0.0, 0.0, 0.0) bounding box, more than p tag
Since these two using the same function, I would expect the provided information be the same, but it happened, I am wondering why and any other easier way to achieve the purpose of adding bounding box information to the html file.

Thanks

Answered by JorjMcKie

Mar 17, 2021

I am afraid this will not work.
HTML, XHTML and XML extraction options are based on original MuPDF functions and as such must be accepted as they are.
The other options are my own making.
To "my" functions, over time and upon request, I added corrective code where errors were reported and introduced some extended features like reduced glyph heights or reducing the text amount to a given clip rectangle.

So when you see a zero bbox in the *ML files, there is nothing I can do. Such things go back to an inconsistent / erroneous PDF or font information. Any corrective code I may be using in my functions cannot be taken over to the *ML functions.

View full answer

JorjMcKie · 2021-03-17T21:37:51Z

JorjMcKie
Mar 17, 2021
Maintainer

I am afraid this will not work.
HTML, XHTML and XML extraction options are based on original MuPDF functions and as such must be accepted as they are.
The other options are my own making.
To "my" functions, over time and upon request, I added corrective code where errors were reported and introduced some extended features like reduced glyph heights or reducing the text amount to a given clip rectangle.

So when you see a zero bbox in the *ML files, there is nothing I can do. Such things go back to an inconsistent / erroneous PDF or font information. Any corrective code I may be using in my functions cannot be taken over to the *ML functions.

4 replies

Yichen-fqyd Mar 17, 2021
Author

Thank you for the quick response,

Sorry for the confusion, I am just trying to add the information to the retuned html file. so I am trying to find the correspondence between theses two functions. Currently, for the most cases, line level found by the "dict" and p tag has nice correspondence.

And zero bbox is actually returned by getText("dict")['blocks'] in a file

JorjMcKie Mar 17, 2021
Maintainer

And zero bbox is actually returned by getText("dict")['blocks'] in a file

If you want and have no confidentiality concerns, let me have that file and check, whether this issue may be solved by the pending v1.18.10 code.

Yichen-fqyd Mar 18, 2021
Author

Can I send it to your email: jorj.x.mckie@outlook.de

JorjMcKie Mar 18, 2021
Maintainer

Sure!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent information between getText("dict")['blocks'] and getText("html") #956

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Inconsistent information between getText("dict")['blocks'] and getText("html") #956

Yichen-fqyd Mar 17, 2021

Replies: 1 comment · 4 replies

JorjMcKie Mar 17, 2021 Maintainer

Yichen-fqyd Mar 17, 2021 Author

JorjMcKie Mar 17, 2021 Maintainer

Yichen-fqyd Mar 18, 2021 Author

JorjMcKie Mar 18, 2021 Maintainer

Yichen-fqyd
Mar 17, 2021

Replies: 1 comment 4 replies

JorjMcKie
Mar 17, 2021
Maintainer

Yichen-fqyd Mar 17, 2021
Author

JorjMcKie Mar 17, 2021
Maintainer

Yichen-fqyd Mar 18, 2021
Author

JorjMcKie Mar 18, 2021
Maintainer