Skip to content

Supporting MuPDF file recognizer #4481

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 9 additions & 12 deletions docs/document.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,15 +176,11 @@ For details on **embedded files** refer to Appendix 3.
* If ``stream`` is given, then the document is created from memory.
* If ``stream`` is `None`, then a document is created from the file given by ``filename``.

:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is always determined from the file content. The ``filetype`` parameter can be used to ensure that the detected type is as expected or, respectively, to force treating any file as plain text.
:arg str,pathlib filename: A UTF-8 string or ``pathlib.Path`` object containing a file path. The document type is (almost [#f8]_) always determined from the file content. The ``filetype`` parameter can be used to force treating any file as plain text. For plain text files, there is no unambiguous way to recognize the content. Therefore the file extension or the ``filetype`` parameter must be given.

:arg bytes,bytearray,BytesIO stream: A memory area containing file data. The document type is **always** detected from the data content. The ``filetype`` parameter is ignored except for undetected data content. In that case only, using ``filetype="txt"`` will treat the data as containing plain text.
:arg bytes,bytearray,BytesIO stream: A memory area containing file data. With few exceptions [#f8]_, the document type is detected from the data content.

:arg str filetype: A string specifying the type of document. This may be anything looking like a filename (e.g. "x.pdf"), in which case MuPDF uses the extension to determine the type, or a mime type like ``application/pdf``. Just using strings like "pdf" or ".pdf" will also work. Can be omitted for :ref:`a supported document type<Supported_File_Types>`.

If opening a file name / path only, it will be used to ensure that the detected type is as expected. An exception is raised for a mismatch. Using `filetype="txt"` will treat any file as containing plain text.

When opening from memory, this parameter is ignored except for undetected data content. Only in that case, using ``filetype="txt"`` will treat the data as containing plain text.
:arg str filetype: A string specifying the type of document. Will be ignored in most [#f8]_ cases for :ref:`a supported document type<Supported_File_Types>`. Text-based files usually have no unambiguous way to recognize the content. Therefore the file extension or the ``filetype`` parameter (especially when opening from memory) must usually be given.

:arg rect_like rect: a rectangle specifying the desired page size. This parameter is only meaningful for documents with a variable page layout ("reflowable" documents), like e-books or HTML, and ignored otherwise. If specified, it must be a non-empty, finite rectangle with top-left coordinates (0, 0). Together with parameter *fontsize*, each page will be accordingly laid out and hence also determine the number of pages.

Expand All @@ -208,13 +204,12 @@ For details on **embedded files** refer to Appendix 3.

>>> # from a file
>>> doc = pymupdf.open("some.xps")
>>> # handle wrong extension
>>> doc = pymupdf.open("some.file", filetype="xps") # assert expected type
>>> doc = pymupdf.open("some.file", filetype="txt") # treat as plain text
>>> # handle wrong / missing extension when required
>>> doc = pymupdf.open("some.file", filetype="mobi") # treat as MOBI e-book
>>>
>>> # from memory
>>> doc = pymupdf.open(stream=mem_area) # works for any supported type
>>> doc = pymupdf.open(stream=unknown-type, filetype="txt") # treat as plain text
>>> doc = pymupdf.open(stream=mem_area) # works for most supported types
>>> doc = pymupdf.open(stream=ambiguous, filetype="mobi") # treat as MOBI e-book
>>>
>>> # new empty PDF
>>> doc = pymupdf.open()
Expand Down Expand Up @@ -2211,4 +2206,6 @@ Other Examples

.. [#f7] This only works under certain conditions. For example, if there is normal text covered by some image on top of it, then this is undetectable and the respective text is **not** removed. Similar is true for white text on white background, and so on.

.. [#f8] Almost all supported document types -- including all images -- are detected by MuPDF's built-in content recognizer. Exceptions are many text-based formats like plain text, program source code, etc. which have no unambiguous way for content identification. The e-book formats MOBI (extension ``.mobi``) and FictionBook (extension ``.fb2``) are two other exceptions which will probably be covered by the recognition feature soon. In these cases, the respective file extensions **must** be present - or (especially when opening from memory) the ``filetype`` must specify the document type.

.. include:: footer.rst
13 changes: 5 additions & 8 deletions docs/how-to-open-a-file.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,22 +38,19 @@ To open a file, do the following:
File Recognizer: Opening with :index:`a Wrong File Extension <pair: wrong; file extension>`
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer".
If you have a document with a wrong file extension for its type, do not worry: it will still be opened correctly, thanks to the integrated file "content recognizer" in the base library.

This component looks at the actual data in the file using a number of heuristics -- independent of the file extension. This of course is also true for file names **without** an extension.

Here is a list of details about how the file content recognizer works:

* When opening from a file name, use the ``filetype`` parameter if you need to make sure that the created :ref:`Document` is of the expected type. An exception is raised for any mismatch.
* Whether opening from a file name or from memory, the recognizer in most cases will determine the correct document type. It does not need or even look at the file extension - which is not available anyway when opening from memory.

* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Tex" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`.
* Text files are an exception: they do not contain recognizable internal structures at all. Here, the file extension ".txt" and the ``filetype`` parameter continue to play a role and are used to create a "Text" document. Correspondingly, text files with other / no extensions, can successfully be opened using `filetype="txt"`.

* Using `filetype="txt"` will treat **any** file as containing plain text when opened from a file name / path -- even when its content is a supported document type.
* Currently, two e-book formats, FictionBook and MOBI, are not automatically recognized. They require the extensions ".fb2" and ".mobi" respectively. Use the ``filetype`` parameter accordingly to open them from memory.

* When opening from a stream, the file content recognizer will ignore the ``filetype`` parameter entirely for known file types -- even in case of a mismatch or when `filetype="txt"` was specified.

* Streams with a known file type cannot be opened as plain text.
* Specifying ``filetype`` currently only has an effect when no match was found. Then using ``filetype="txt"`` will treat the file as containing plain text.
* Using `filetype="txt"` will treat **any** file as containing plain text -- even when its content is a supported document type.


----------
Expand Down
100 changes: 50 additions & 50 deletions src/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2922,8 +2922,6 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0
else:
raise TypeError(f"bad stream: {type(stream)=}.")
stream = self.stream
if not (filename or filetype):
filename = 'pdf'
else:
self.stream = None

Expand Down Expand Up @@ -2951,6 +2949,17 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0
w = r.x1 - r.x0
h = r.y1 - r.y0

if from_file:
_, magic2 = os.path.splitext(filename)
if magic2.startswith("."):
magic2 = magic2[1:]
else:
magic2 = ""
if isinstance(filetype, str):
magic = filetype
else:
magic = ""

if stream is not None:
assert isinstance(stream, (bytes, memoryview))
if len(stream) == 0:
Expand All @@ -2962,65 +2971,56 @@ def __init__(self, filename=None, stream=None, filetype=None, rect=None, width=0
buffer_ = mupdf.fz_new_buffer_from_copied_data(c)
data = mupdf.fz_open_buffer(buffer_)
else:
# Pass raw bytes data to mupdf.fz_open_memory(). This assumes
# that the bytes string will not be modified; i think the
# original PyMuPDF code makes the same assumption. Presumably
# setting self.stream above ensures that the bytes will not be
# garbage collected?
data = mupdf.fz_open_memory(mupdf.python_buffer_data(c), len(c))
magic = filename
if not magic:
magic = filetype
# fixme: pymupdf does:
# handler = fz_recognize_document(gctx, filetype);
# if (!handler) raise ValueError( MSG_BAD_FILETYPE)
# but prefer to leave fz_open_document_with_stream() to raise.

try:
doc = mupdf.fz_open_document_with_stream(magic, data)
if magic:
handler = mupdf.ll_fz_recognize_document(magic)
if not handler:
raise FileDataError("Failed to open stream as {magic}")
Comment on lines +2979 to +2980
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this contradict the new docs which says:
:arg str filetype: A string specifying the type of document. Will be ignored in most [#f8]_ cases for :ref:a supported document type<Supported_File_Types>.
?
Shouldn't we should get a handler with fz_recognize_document_stream_content() here, so that the contents overrides the supplied magic where possible?

Or simply use fz_open_document_with_stream(), i.e. don't create an intermediate handler? I.e. remove the entire if magic: ... block in the code?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recognizer seems to always detect the type except in 3 cases: "txt", "fb2", "mobi". This means we have identical behavior between file-open versus stream-open.
Only those three cases still require the correct extension, respectively filetype. Everything else disregards file extension, respectively needs no filetype-help.

The two cases "fb2" and "mobi" can be resolved with a little improved recognizer, which will hopefully happen soon.

The case "txt" remains an exception because of principle reasons.

This is the background of my new logic.

accel = mupdf.FzStream()
archive = mupdf.FzArchive(None)
doc = mupdf.ll_fz_document_handler_open(
handler,
data.m_internal,
accel.m_internal,
archive.m_internal,
None, # recognize_state
)
doc = mupdf.FzDocument(doc)
else:
doc = mupdf.fz_open_document_with_stream(magic, data)
except Exception as e:
if g_exceptions_verbose > 1: exception_info()
raise FileDataError('Failed to open stream') from e
else:
if filename:
if not filetype:
if magic == "txt":
handler = mupdf.ll_fz_recognize_document(magic)
else:
stream = mupdf.FzStream(filename)
handler = mupdf.ll_fz_recognize_document_stream_content(stream.m_internal, magic)
if not handler and magic2:
handler = mupdf.ll_fz_recognize_document_stream_content(stream.m_internal, magic2)
if handler:
#log( f'{handler.open=}')
#log( f'{dir(handler.open)=}')
try:
doc = mupdf.fz_open_document(filename)
accel = mupdf.FzStream()
archive = mupdf.FzArchive(None)
doc = mupdf.ll_fz_document_handler_open(
handler,
stream.m_internal,
accel.m_internal,
archive.m_internal,
None, # recognize_state
)
except Exception as e:
if g_exceptions_verbose > 1: exception_info()
raise FileDataError(f'Failed to open file {filename!r}.') from e
raise FileDataError(f'Failed to open file {filename!r}') from e
doc = mupdf.FzDocument(doc)
else:
handler = mupdf.ll_fz_recognize_document(filetype)
if handler:
if handler.open:
#log( f'{handler.open=}')
#log( f'{dir(handler.open)=}')
try:
stream = mupdf.FzStream(filename)
accel = mupdf.FzStream()
archive = mupdf.FzArchive(None)
if mupdf_version_tuple >= (1, 24, 8):
doc = mupdf.ll_fz_document_handler_open(
handler,
stream.m_internal,
accel.m_internal,
archive.m_internal,
None, # recognize_state
)
else:
doc = mupdf.ll_fz_document_open_fn_call(
handler.open,
stream.m_internal,
accel.m_internal,
archive.m_internal,
)
except Exception as e:
if g_exceptions_verbose > 1: exception_info()
raise FileDataError(f'Failed to open file {filename!r} as type {filetype!r}.') from e
doc = mupdf.FzDocument( doc)
else:
assert 0
else:
raise ValueError( MSG_BAD_FILETYPE)
raise ValueError(MSG_BAD_FILETYPE)
else:
pdf = mupdf.PdfDocument()
doc = mupdf.FzDocument(pdf)
Expand Down
64 changes: 64 additions & 0 deletions tests/resources/fb2-file.fb2
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
<?xml version="1.0" encoding="UTF-8"?>
<FictionBook xmlns="http://www.gribuser.ru/xml/fictionbook/2.0" xmlns:xlink="http://www.w3.org/1999/xlink">
<description>
<title-info>
<genre>computers</genre>
<author>
<first-name>Chris</first-name>
<last-name>Clark</last-name>
</author>
<book-title>Sample FB2 book</book-title>
<annotation>
<p>Short sample of a FictionBook2 book with simple metadata. Based on test_book.md from https://github.com/clach04/sample_reading_media</p>
</annotation>
<keywords>ebook,sample,markdown,fb2,FictionBook2</keywords>
</title-info>
<document-info>
<author>
<nickname>clach04</nickname>
<home-page>https://github.com/clach04/sample_reading_media</home-page>
</author>

<program-used>vim and scite</program-used>
<src-url>https://github.com/clach04/sample_reading_media</src-url>
<version>1.0</version>
<history>
<p>Initial version, written by hand.</p>
</history>
</document-info>
</description>
<body>
<title>
<p>This is a title</p>
</title>

<section id="test-header-h1">
<title>
<p>Test Header h1</p>
</title>

<p>A test paragraph.</p>
<p>Another test paragraph.</p>
</section>

<section id="another-test-header-h1">
<title>
<p>Another Test Header h1</p>
</title>

<section id="a-test-header-h2">
<title>
<p>A Test Header h2</p>
</title>

<section id="a-test-header-h3">
<title>
<p>A Test Header h3</p>
</title>

<p>Yet more copy</p>
</section>
</section>
</section>
</body>
</FictionBook>
Binary file added tests/resources/mobi-file.mobi
Binary file not shown.
18 changes: 18 additions & 0 deletions tests/resources/svg-file.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added tests/resources/xps-file.xps
Binary file not shown.
Loading
Loading