diff --git a/docs/changes.rst b/docs/changes.rst index 084e38ca4..a71560c4e 100644 --- a/docs/changes.rst +++ b/docs/changes.rst @@ -1,6 +1,23 @@ Change Logs =============== +Changes in Version 1.18.16 +--------------------------- +* **Fixed** issue `#1184 `_. Existing PDF widget fonts in a PDF are now accepted (i.e. not forcedly changed to a Base-14 font). + +* **Fixed** issue `#1154 `_. Text search hits should now be correct when ``clip`` is specified. + +* **Fixed** issue `#1152 `_. + +* **Fixed** issue `#1146 `_. + +* **Added** :attr:`Link.flags` and :meth:`Link.set_flags` to the :ref:`Link` class. Implements enhancement requests `#1187 `_. + +* **Added** option to *simulate* :meth:`TextWriter.fill_textbox` output for predicting the number of lines, that a given text would occupy in the textbox. + +* **Added** text output support as subcommand `gettext` to the ``fitz`` CLI module. Most importantly, original **physical text layout** reproduction is now supported. + + Changes in Version 1.18.15 --------------------------- * **Fixed** issue `#1088 `_. Removing an annotation's fill color should now work again both ways, using the ``fill_color=[]`` argument in :meth:`Annot.update` as well as ``fill=[]`` in :meth:`Annot.set_colors`. diff --git a/docs/conf.py b/docs/conf.py index 9845d1060..86a4483d0 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -43,7 +43,7 @@ # built documents. # # The full version, including alpha/beta/rc tags. -release = "1.18.15" +release = "1.18.16" # The short X.Y version version = release diff --git a/docs/document.rst b/docs/document.rst index c25495680..e3e229a0f 100644 --- a/docs/document.rst +++ b/docs/document.rst @@ -1103,6 +1103,8 @@ For details on **embedded files** refer to Appendix 3. :arg str user_pw: *(new in version 1.16.0)* set the document's user password. + .. note:: The method does not check, whether a file of that name already exists, will hence not ask for confirmation, and overwrite the file. It is your responsibility as a programmer to handle this. + .. method:: ez_save(*args, **kwargs) *(New in v1.18.11)* diff --git a/docs/faq.rst b/docs/faq.rst index f9ac92005..5a1fa4e25 100644 --- a/docs/faq.rst +++ b/docs/faq.rst @@ -2103,6 +2103,38 @@ If it is *False* or if you want to be on the safe side, pick one of the followin page.wrap_contents() >>> # start inserting text, images or annotations here + +Missing or Unreadable Extracted Text +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +This can be a number of different problems. + +Problem: no text +^^^^^^^^^^^^^^^^ +Your PDF viewer does display text, but you cannot select it with your cursor, and text extraction delivers nothing. + +Cause +^^^^^^ +1. You may be looking at an image embedded in the PDF page (e.g. a scanned PDF). +2. The PDF creator used no font, but **simulated** text by painting it, using little lines and curves. E.g. a capital "D" could be painted by a line "|" and a left-open semi-circle, an "o" by an ellipse, and so on. + +Solution +^^^^^^^^^^ +Use an OCR software like `OCRmyPDF `_ to insert a hidden text layer underneath the visible page. The resulting PDF should behave as expected. + +Problem: unreadable text +^^^^^^^^^^^^^^^^^^^^^^^^ +Text extraction does not deliver the text in readable order, duplicates some text, or is otherwise garbled. + +Cause +^^^^^^ +1. The single characters are redable as such (no "" symbols), but the sequence in which the text is **coded in the file** deviates from the reading order. The motivation behind may be technical or protection of data against unwanted copies. +2. Many "" symbols occur indicating MuPDF could not interpret these characters. The PDF creator may haved used a font that displays readable text, but obfuscates the unicode character that leads to the readable symbol (glyph). + +Solution +^^^^^^^^ +1. Use layout preserving text extraction: ``python -m fitz gettext file.pdf``. +2. If other text extraction tools also don't work, then the only solution again is OCR-ing the page. + -------------------------- Low-Level Interfaces diff --git a/docs/functions.rst b/docs/functions.rst index 19343b657..f7473bd95 100644 --- a/docs/functions.rst +++ b/docs/functions.rst @@ -40,6 +40,7 @@ Yet others are handy, general-purpose utilities. :meth:`Page.run` run a page through a device :meth:`Page.read_contents` PDF only: get complete, concatenated /Contents source :meth:`Page.wrap_contents` wrap contents with stacking commands +:meth:`Page._get_texttrace()` low-level text information :attr:`Page.is_wrapped` check whether contents wrapping is present :meth:`planish_line` matrix to map a line to the x-axis :meth:`paper_size` return width, height for a known paper format @@ -396,12 +397,79 @@ Yet others are handy, general-purpose utilities. ----- - .. method:: Page.wrap_contents + .. method:: Page.wrap_contents() Put string pair "q" / "Q" before, resp. after a page's */Contents* object(s) to ensure that any "geometry" changes are **local** only. Use this method as an alternative, minimalistic version of :meth:`Page.clean_contents`. Its advantage is a small footprint in terms of processing time and impact on the data size of incremental saves. +----- + + .. method:: Page._get_texttrace() + + *New in v1.18.16* + + Return low-level text information of the page (**all** document types). This is a list of Python dictionaries with the following content:: + + { + 'ascender': 0.75, # font ascender (1) + 'bidi': 0, # bidirectional level (1) + 'chars': ( # char information, tuple[tuple] + (32, # unicode (4) + 3, # glyph id (font dependent) + (470.3800354003906, # origin.x (1) + 755.3758544921875), # origin.y (1) + 2.495859366375953 # width (points) + ), + ), + 'color': (0.0,), # text color, tuple[float] (1) + 'colorspace': 1, # number of colorspace components (1) + 'descender': -0.25, # font descender (1) + 'dir': (1.0, 0.0), # writing direction (1) + 'flags': 4, # font flags (1) + 'font': 'Calibri', # font name (1) + 'linewidth': 0.5519999980926514, # last know line width value (3) + 'opacity': 1.0, # alpha value of the text (5) + 'scissor': (1.0, 1.0, -1.0, -1.0), # + 'size': 11.039999961853027, # font size (1) + 'spacewidth': 2.495859366375953, # width of space character (synthesized) + 'type': 0, # span type (2) + 'wmode': 0 # writing mode (1) + } + + Details: + + 1. Same meaning as explained in :ref:`TextPage`. + 2. There are 5 text span types: + + 0. Filled text -- equivalent to PDF text rendering mode 0 (``0 Tr``), only the characters' inside is shown. + 1. Stroked text -- equivalent to ``1 Tr``, only the character borders are shown. + 2. Clipped text -- details yet unknown. + 3. Clip-stroked text -- details yet unknown. + 4. Ignored text -- equivalent to ``3 Tr``. + + 3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of stroked lines. This value may not be provided in the data. In this case, a value of ``span["size"] * 0,05`` is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent text span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except one is ``span["type"] = 0`` and the other one ``span["type"] = 1``. + 4. For data compactness, the character's unicode is provided here. Use function ``chr()`` for the character itself. + 5. The alpha / pacity value of the span's text, 0 <= opacity <= 1. Zero is invisible text, 1 (100%) covers what is behind. + + Here is a list of similarities and differences of ``page._get_texttrace()`` compared to ``page.get_text("rawdict")``: + + * The method is up to **twice as fast.** + * The returned information is very **much smaller in size.** + * Additional types of text **invisibility can be detected**: opacity = 0 and type = 4. + * Character bboxes are not provided; if needed, compute them from available information. + * If MuPDF returns unicode 0xFFFD (65533) for unrecognized characters, you may still be able to deduct required information from the glyph id. + * The ``span["chars"]`` **contains no spaces**, **except** the document creator has coded them. They **will never be generated** like it happens in :meth:`Page.get_text` methods. To provide some help for doing your own computations here, the width of a space character is given. This value is derived from the font where possible. Otherwise a synthetic value is taken. + * There is no effort to organize text like it happens for a :ref:`TextPage` (the hierarchy of blocks, lines, spans, and characters). Characters are simply extracted in sequence, one by one, and put in a span. Whenever any of the span's characteristics change, a new span is started. So you may find characters with different ``origin.y`` values in the same span. You cannot assume, that span characters are sorted in any particular order -- you must make sense of the info yourself, taking ``span["dir"]``, ``span["wmode"]``, etc. into account. + * Ligatures are represented like this: + - MuPDF handles these ligatures: "fi", "ff", "fl", "ft", "st", "ffi", and "ffl". If the page contains e.g. ligature "fi", you will find the following two character items subsequent to each other:: + + (102, glyph, (x, y), width) # 102 = ord("f") + (105, -1, (x, y), 0) # 105 = ord("i") + + - This means that the ligature character components are shown combined within the space given by width. It is up to you, how you want to handle these cases in your text extraction. This is similar to ``page.get_text("rawdict")``: a glyph id is never available there, but you can assume a ligature if you encounter one of the character combinations above, having the **same origin** and ``bbox.width = 0`` except for the first character. + + ----- .. attribute:: Page.is_wrapped @@ -412,13 +480,13 @@ Yet others are handy, general-purpose utilities. .. method:: Page.get_text_blocks(flags=None) - Deprecated wrapper for :meth:`TextPage.extractBLOCKS`. Use :meth:`Page.getText` with the "blocks" option instead. + Deprecated wrapper for :meth:`TextPage.extractBLOCKS`. Use :meth:`Page.get_text` with the "blocks" option instead. ----- .. method:: Page.get_text_words(flags=None) - Deprecated wrapper for :meth:`TextPage.extractWORDS`. Use :meth:`Page.getText` with the "words" option instead. + Deprecated wrapper for :meth:`TextPage.extractWORDS`. Use :meth:`Page.get_text` with the "words" option instead. ----- diff --git a/docs/images/img-layout-text.jpg b/docs/images/img-layout-text.jpg new file mode 100644 index 000000000..e33564525 Binary files /dev/null and b/docs/images/img-layout-text.jpg differ diff --git a/docs/link.rst b/docs/link.rst index 0c9694a02..34af9a975 100644 --- a/docs/link.rst +++ b/docs/link.rst @@ -12,10 +12,12 @@ There is a parent-child relationship between a link and its page. If the page ob ========================= ============================================ :meth:`Link.set_border` modify border properties :meth:`Link.set_colors` modify color properties +:meth:`Link.set_flags` modify link flags :attr:`Link.border` border characteristics :attr:`Link.colors` border line color :attr:`Link.dest` points to destination details :attr:`Link.is_external` external destination? +:attr:`Link.flags` link annotation flags :attr:`Link.next` points to next link :attr:`Link.rect` clickable area in untransformed coordinates. :attr:`Link.uri` link destination @@ -40,7 +42,7 @@ There is a parent-child relationship between a link and its page. If the page ob .. method:: set_colors(colors=None, stroke=None) - Changes the "stroke" color. + PDF only: Changes the "stroke" color. .. note:: In PDF, links are a subtype of annotations technically and **do not support fill colors**. However, to keep a consistent API, we do allow specifying a ``fill=`` parameter like with all annotations, which will be ignored with a warning. @@ -49,6 +51,19 @@ There is a parent-child relationship between a link and its page. If the page ob :arg dict colors: a dictionary containing color specifications. For accepted dictionary keys and values see below. The most practical way should be to first make a copy of the *colors* property and then modify this dictionary as required. :arg sequence stroke: see above. + .. method:: set_flags(flags) + + *New in v1.18.16* + + Set the PDF ``/F`` property of the link annotation. See :meth:`Annot.set_flags` for details. If not a PDF, this method is a no-op. + + + .. attribute:: flags + + *New in v1.18.16* + + Return the link annotation flags, an integer (see :attr:`Annot.flags` for details). Zero if not a PDF. + .. attribute:: colors diff --git a/docs/module.rst b/docs/module.rst index 2cdc39a93..d16a43036 100644 --- a/docs/module.rst +++ b/docs/module.rst @@ -1,46 +1,46 @@ .. _Module: ============================ -Using *fitz* as a Module +Module *fitz* ============================ -.. highlight:: python - *(New in version 1.16.8)* -PyMuPDF can also be used in the command line as a **module** to perform basic utility functions. +PyMuPDF can also be used in the command line as a **module** to perform utility functions. This feature should obsolete writing some of the most basic scripts. -This is work in progress and subject to changes. This feature should obsolete writing some of the most basic scripts. - -As a guideline we are using the feature set of MuPDF command line tools. Admittedly, there is some functional overlap. On the other hand, PDF embedded files are no longer supported by MuPDF, so PyMuPDF is offering something unique here. +Admittedly, there is some functional overlap with the MuPDF CLI ``mutool``. On the other hand, PDF embedded files are no longer supported by MuPDF, so PyMuPDF is offering something unique here. Invocation ----------- Invoke the module like this:: - python -m fitz command parameters + python -m fitz + +.. highlight:: python General remarks: -* Request help via *"-h"*, resp. command-specific help via *"command -h"*. -* Parameters may be abbreviated as long as the result is not ambiguous (Python 3.5 or later only). -* Several commands support parameters *-pages* and *-xrefs*. They are intended for down-selection. Please note that: +* Request help via ``"-h"``, resp. command-specific help via ``"command -h"``. +* Parameters may be abbreviated where this does not introduce ambiguities. +* Several commands support parameters ``-pages`` and ``-xrefs``. They are intended for down-selection. Please note that: - **page numbers** for this utility must be given **1-based**. - valid :data:`xref` numbers start at 1. - - Specify any number of either single integers or integer ranges, separated by one comma each. A **range** is a pair of integers separated by one hyphen "-". Integers must not exceed the maximum page number or resp. :data:`xref` number. To specify that maximum, the symbolic variable "N" may be used instead of an integer. Integers or ranges may occur several times, in any sequence and may overlap. If in a range the first number is greater than the second one, the respective items will be processed in reversed order. + - Specify a comma-separated list of either *single* integers or integer *ranges*. A **range** is a pair of integers separated by one hyphen "-". Integers must not exceed the maximum page, resp. xref number. To specify that maximum, the symbolic variable "N" may be used. Integers or ranges may occur several times, in any sequence and may overlap. If in a range the first number is greater than the second one, the respective items will be processed in reversed order. -* You can also use the fitz module inside your script:: +* How to use the module inside your script:: >>> from fitz.__main__ import main as fitz_command - >>> cmd = "clean input.pdf output.pdf -pages 1,N".split() # prepare command - >>> saved_parms = sys.argv[1:] # save original parameters - >>> sys.argv[1:] = cmd # store command - >>> fitz_command() # execute command - >>> sys.argv[1:] = saved_parms # restore original parameters + >>> cmd = "clean input.pdf output.pdf -pages 1,N".split() # prepare command line + >>> saved_parms = sys.argv[1:] # save original command line + >>> sys.argv[1:] = cmd # store new command line + >>> fitz_command() # execute module + >>> sys.argv[1:] = saved_parms # restore original command line + +* Use the following 2-liner and compile it with `Nuitka `_ in standalone mode. This will give you a CLI executable with all the module's features, that can be used on all compatible platforms without Python, PyMuPDF or MuPDF being installed. -* You can use the following 2-liner and compile it with `Nuitka `_ in either normal or standalone mode, if you want to distribute it. This will give you a command line utility with all the functions explained below:: +:: from fitz.__main__ import main main() @@ -408,4 +408,72 @@ Copy embedded files between PDFs:: restrict copy to these entries +Text Extraction +---------------- +*(New in v1.18.16)* + +Extract text from arbitrary supported documents **(not only PDF)** to a textfile. Currently, there are three output formatting modes available: simple, block sorting and reproduction of physical layout. + +* **Simple** text extraction reproduces all text as it appears in the document pages -- no effort is made to rearrange in any particular reading order. +* **Block sorting** sorts text blocks (as identified by MuPDF) by ascending vertical, then horizontal coordinates. This should be sufficient to establish a "natural" reading order for basic pages of text. +* **Layout** strives to reproduce the original appearance of the input pages. You can expect results like this (produced by the command ``python -m fitz gettext -pages 1 demo1.pdf``): + +.. image:: images/img-layout-text.* + :scale: 60 + +.. note:: The "gettext" command offers a functionality similar to the CLI tool ``pdftotext`` by XPDF software, http://www.foolabs.com/xpdf/ -- this is especially true for "layout" mode, which combines that tool's ``-layout`` and ``-table`` options. + + + +After each page of the output file, a formfeed character, ``hex(12)`` is written -- even if the input page has no text at all. This behavior can be controlled via options. + +.. note:: For "layout" mode, **only horizontal, left-to-right, top-to bottom** text is supported, other text is ignored. In this mode, text is also ignored, if its fontsize is too small. + + "Simple" and "blocks" mode in contrast output **all text** for any text size or orientation. + +Command:: + + python -m fitz gettext -h + usage: fitz gettext [-h] [-password PASSWORD] [-mode {simple,blocks,layout}] [-pages PAGES] [-noligatures] + [-whitespace] [-extra-spaces] [-noformfeed] [-skip-empty] [-output OUTPUT] [-grid GRID] + [-fontsize FONTSIZE] + input + + ----------------- extract text in various formatting modes ---------------- + + positional arguments: + input input document path + + optional arguments: + -h, --help show this help message and exit + -password PASSWORD password for input document + -mode {simple,blocks,layout} + mode: simple, block sort, or layout (default) + -pages PAGES select pages, format: 1,5-7,50-N + -noligatures expand ligature characters (default False) + -whitespace keep whitespace characters (default False) + -extra-spaces fill gaps with spaces (default False) + -noformfeed write linefeeds, no formfeeds (default False) + -skip-empty suppress pages with no text (default False) + -output OUTPUT store text in this file (default filename.txt) + -grid GRID merge lines if closer than this (default 2) + -fontsize FONTSIZE only include text with a larger fontsize (default 3) + +.. note:: Command options may be abbreviated as long as no ambiguities are introduced. So the following do the same: + + * ``... -output text.txt -noligatures -noformfeed -whitespace -grid 3 -extra-spaces ...`` + * ``... -o text.txt -nol -nof -w -g 3 -e ...`` + + The output filename defaults to the input with its extension replaced by ``.txt``. As with other commands, you can select page ranges **(caution: 1-based!)** in ``mutool`` format, as indicated above. + +* **mode:** (str) select a formatting mode -- default is "layout". +* **noligatures:** (bool) corresponds to **not** :data:`TEXT_PRESERVE_LIGATURES`. If specified, ligatures (present in advanced fonts: glyphs combining multiple characters like "fi") are split up into their components (i.e. "f", "i"). Default is passing them through. +* **whitespace:** (bool) corresponds to :data:`TEXT_PRESERVE_WHITESPACE`. If specified, all white space characters (like tabs) are replaced with one or more spaces. Default is passing them through. +* **extra-spaces:** (bool) corresponds to **not** :data:`TEXT_INHIBIT_SPACES`. If specified, large gaps between adjacent characters will be filled with one or more spaces. Default is off. +* **noformfeed:** (bool) instead of ``hex(12)`` (formfeed), write linebreaks ``\n`` at end of output pages. +* **skip-empty:** (bool) skip pages with no text. +* **grid:** (float) lines with a vertical coordinate difference of no more than this value (in points) will be merged into the same output line. Only relevant for "layout" mode. **Use with care:** the default 2 should be adequate in most cases. If **too large**, lines that are *intended* to be different in the original may be merged and will result in garbled and / or incomplete output. If **too low**, artifact separate output lines may be generated for text spans just because they are coded in a different font with slightly deviating properties. +* **fontsize:** (float) include text with fontsize larger than this value only (default 3). Only relevant for "layout" option. + + .. highlight:: python diff --git a/docs/version.rst b/docs/version.rst index 361ba36a6..ab8ed98c8 100644 --- a/docs/version.rst +++ b/docs/version.rst @@ -1,6 +1,6 @@ Covered Version -------------------- -This documentation covers PyMuPDF v1.18.15 features as of **2021-07-10 00:00:01**. +This documentation covers PyMuPDF v1.18.16 features as of **2021-08-05 00:00:01**. .. note:: The major and minor versions of **PyMuPDF** and **MuPDF** will always be the same. Only the third qualifier (patch level) may deviate from that of MuPDF. \ No newline at end of file diff --git a/fitz/__main__.py b/fitz/__main__.py index d28d10e93..8f52d856f 100644 --- a/fitz/__main__.py +++ b/fitz/__main__.py @@ -581,7 +581,7 @@ def page_layout(page, textout, GRID, fontsize, noformfeed, skip_empty, flags): eop = b"\n" if noformfeed else bytes([12]) # -------------------------------------------------------------------- - def find_line_index(values: list[int], value: int) -> int: + def find_line_index(values: list, value: int) -> int: """Find the right row coordinate. Args: