Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-support font name prefixes in span font names #896

Closed
darraghmckay opened this issue Feb 12, 2021 · 9 comments
Closed

Re-support font name prefixes in span font names #896

darraghmckay opened this issue Feb 12, 2021 · 9 comments
Assignees

Comments

@darraghmckay
Copy link

Is your feature request related to a problem? Please describe.

Before version 1.17.6 when you extracted the page text dict you would get font names like
MJGHPI+TimesNewRomanPSMT
and
YCCJKF+TimesNewRomanPSMT

Now those same text spans both return
TimesNewRomanPSMT

Note the lack of leading prefix

This is a breaking change for our current implementation and we can not upgrade beyond 1.17.5 until we can specify that full font-names should be used.

This is the line that was changed

https://github.com/pymupdf/PyMuPDF/compare/60e0c1fd5abadf61905253ea2fa19f62cb28e66e..10341cea796e8cbde86959a590d87b2596c27085#diff-04606915a2aa7f21b7798f15aba6f7b29a8900c7ac7403b13f2237f8214749ecR184

Describe the solution you'd like
It would nice to be be able to do one of the following:

  1. Add a flag/option to get_text("dict") that specifies that font names should be returned as is
  2. The font's xref is included in the final span so that it's xref could be found in the page's font list

Describe alternatives you've considered
I explored option 2 above but I don't currently think that's possible

Additional context
If you think this is something that is in-line with the project's vision I will happily implement it.

@JorjMcKie
Copy link
Collaborator

Ok, I understand.
Using the flags parameter for this won't work, because it is MuPDF's own bit switch field. Who knows what will happen to it in the future.
But I can offer to introduce a global parameter, that could be set like fitz.TOOLS.set_full_fontnames(True) or so - with False being the default.
This would be in effect for the rest of the current script, or until changed.

@darraghmckay
Copy link
Author

That solution sounds great. Thanks for the suggestion

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Feb 12, 2021

It is easy-peasy to do. At least for the dict / json / rawdict / rawjson outputs.
You may know, that XML, XHTML, HTML are original MuPDF code, so they will continue to behave as they currently do.
The other variants, text / words / blocks do not report the font name - so hopefully, that change will have no collateral impact ...
Please expect this to be included in the next version, 1.18.9.
If you want to test a pre-version, please let me know your environment, so I can point you to the right download location.

@darraghmckay
Copy link
Author

Excellent, thanks so much, that looks perfect

@JorjMcKie
Copy link
Collaborator

You can find Linux and Mac OSX pre-version wheels in respective branches of this repo.
Drop me a note if you want to test Windows.

@darraghmckay
Copy link
Author

Sorry for not getting back to you sooner. This worked exactly as expected thanks

@JorjMcKie
Copy link
Collaborator

I am planning to publish an official new version with a week from now.

JorjMcKie added a commit that referenced this issue Feb 26, 2021
@JorjMcKie
Copy link
Collaborator

Fixed in v1.18.9 currently being uploaded.

@darraghmckay
Copy link
Author

Thanks again for this, it worked as expected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants