Skip to content

Inconsistent information between getText("dict")['blocks'] and getText("html") #956

Discussion options

You must be logged in to vote

I am afraid this will not work.
HTML, XHTML and XML extraction options are based on original MuPDF functions and as such must be accepted as they are.
The other options are my own making.
To "my" functions, over time and upon request, I added corrective code where errors were reported and introduced some extended features like reduced glyph heights or reducing the text amount to a given clip rectangle.

So when you see a zero bbox in the *ML files, there is nothing I can do. Such things go back to an inconsistent / erroneous PDF or font information. Any corrective code I may be using in my functions cannot be taken over to the *ML functions.

Replies: 1 comment 4 replies

Comment options

You must be logged in to vote
4 replies
@Yichen-fqyd
Comment options

@JorjMcKie
Comment options

@Yichen-fqyd
Comment options

@JorjMcKie
Comment options

Answer selected by Yichen-fqyd
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants