Skip to content

Supporting MuPDF file recognizer #4481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

JorjMcKie
Copy link
Collaborator

We previously used slightly different logic for opening documents from files versus from memory. This fix strives to always use MuPDF's file type recognition and thus become independent from the value of file extensions as much as possible. This works for almost all document types. Exceptions are "txt", "fb2" and "mobi", where either a valid file extension must be present, or the respective filetype must be provided.

We previously used slightly different logic for opening documents from files versus from memory.
This fix strives to always use MuPDF's file type recognition and thus become independent from the value of file extensions as much as possible.
This works for almost all document types. Exceptions are "txt", "fb2" and "mobi", where either a valid file extension must be present, or the respective filetype must be provided.
@JorjMcKie JorjMcKie force-pushed the support-mupdf-recognizer branch from 1fb4d0d to 2c90d6c Compare May 1, 2025 09:14
Comment on lines +2979 to +2980
if not handler:
raise FileDataError("Failed to open stream as {magic}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this contradict the new docs which says:
:arg str filetype: A string specifying the type of document. Will be ignored in most [#f8]_ cases for :ref:a supported document type<Supported_File_Types>.
?
Shouldn't we should get a handler with fz_recognize_document_stream_content() here, so that the contents overrides the supplied magic where possible?

Or simply use fz_open_document_with_stream(), i.e. don't create an intermediate handler? I.e. remove the entire if magic: ... block in the code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recognizer seems to always detect the type except in 3 cases: "txt", "fb2", "mobi". This means we have identical behavior between file-open versus stream-open.
Only those three cases still require the correct extension, respectively filetype. Everything else disregards file extension, respectively needs no filetype-help.

The two cases "fb2" and "mobi" can be resolved with a little improved recognizer, which will hopefully happen soon.

The case "txt" remains an exception because of principle reasons.

This is the background of my new logic.

Copy link
Collaborator

@jamie-lemon jamie-lemon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs side of things is fine. No typos - all good! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants