upload v1.18.16

pymupdf · Aug 7, 2021 · be074d5 · be074d5
1 parent d1c6e30
commit be074d5
Show file tree

Hide file tree

Showing 10 changed files with 228 additions and 26 deletions.
diff --git a/docs/changes.rst b/docs/changes.rst
@@ -1,6 +1,23 @@
 Change Logs
 ===============
 
+Changes in Version 1.18.16
+---------------------------
+* **Fixed** issue `#1184 <https://github.com/pymupdf/PyMuPDF/issues/1184>`_. Existing PDF widget fonts in a PDF are now accepted (i.e. not forcedly changed to a Base-14 font).
+
+* **Fixed** issue `#1154 <https://github.com/pymupdf/PyMuPDF/issues/1154>`_. Text search hits should now be correct when ``clip`` is specified.
+
+* **Fixed** issue `#1152 <https://github.com/pymupdf/PyMuPDF/issues/1152>`_.
+
+* **Fixed** issue `#1146 <https://github.com/pymupdf/PyMuPDF/issues/1146>`_.
+
+* **Added** :attr:`Link.flags` and :meth:`Link.set_flags` to the :ref:`Link` class. Implements enhancement requests `#1187 <https://github.com/pymupdf/PyMuPDF/issues/1187>`_.
+
+* **Added** option to *simulate* :meth:`TextWriter.fill_textbox` output for predicting the number of lines, that a given text would occupy in the textbox.
+
+* **Added** text output support as subcommand `gettext` to the ``fitz`` CLI module. Most importantly, original **physical text layout** reproduction is now supported.
+
+
 Changes in Version 1.18.15
 ---------------------------
 * **Fixed** issue `#1088 <https://github.com/pymupdf/PyMuPDF/issues/1088>`_. Removing an annotation's fill color should now work again both ways, using the ``fill_color=[]`` argument in :meth:`Annot.update` as well as ``fill=[]`` in :meth:`Annot.set_colors`.

diff --git a/docs/conf.py b/docs/conf.py
@@ -43,7 +43,7 @@
 # built documents.
 #
 # The full version, including alpha/beta/rc tags.
-release = "1.18.15"
+release = "1.18.16"
 
 # The short X.Y version
 version = release

diff --git a/docs/document.rst b/docs/document.rst
@@ -1103,6 +1103,8 @@ For details on **embedded files** refer to Appendix 3.
 
       :arg str user_pw: *(new in version 1.16.0)* set the document's user password.
 
+      .. note:: The method does not check, whether a file of that name already exists, will hence not ask for confirmation, and overwrite the file. It is your responsibility as a programmer to handle this.
+
     .. method:: ez_save(*args, **kwargs)
 
       *(New in v1.18.11)*

diff --git a/docs/faq.rst b/docs/faq.rst
@@ -2103,6 +2103,38 @@ If it is *False* or if you want to be on the safe side, pick one of the followin
             page.wrap_contents()
     >>> # start inserting text, images or annotations here
 
+
+Missing or Unreadable Extracted Text
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+This can be a number of different problems.
+
+Problem: no text
+^^^^^^^^^^^^^^^^
+Your PDF viewer does display text, but you cannot select it with your cursor, and text extraction delivers nothing.
+
+Cause
+^^^^^^
+1. You may be looking at an image embedded in the PDF page (e.g. a scanned PDF).
+2. The PDF creator used no font, but **simulated** text by painting it, using little lines and curves. E.g. a capital "D" could be painted by a line "|" and a left-open semi-circle, an "o" by an ellipse, and so on.
+
+Solution
+^^^^^^^^^^
+Use an OCR software like `OCRmyPDF <https://pypi.org/project/ocrmypdf/>`_ to insert a hidden text layer underneath the visible page. The resulting PDF should behave as expected.
+
+Problem: unreadable text
+^^^^^^^^^^^^^^^^^^^^^^^^
+Text extraction does not deliver the text in readable order, duplicates some text, or is otherwise garbled.
+
+Cause
+^^^^^^
+1. The single characters are redable as such (no "<?>" symbols), but the sequence in which the text is **coded in the file** deviates from the reading order. The motivation behind may be technical or protection of data against unwanted copies.
+2. Many "<?>" symbols occur indicating MuPDF could not interpret these characters. The PDF creator may haved used a font that displays readable text, but obfuscates the unicode character that leads to the readable symbol (glyph).
+
+Solution
+^^^^^^^^
+1. Use layout preserving text extraction: ``python -m fitz gettext file.pdf``.
+2. If other text extraction tools also don't work, then the only solution again is OCR-ing the page.
+
 --------------------------
 
 Low-Level Interfaces

diff --git a/docs/functions.rst b/docs/functions.rst
@@ -40,6 +40,7 @@ Yet others are handy, general-purpose utilities.
 :meth:`Page.run`                     run a page through a device
 :meth:`Page.read_contents`           PDF only: get complete, concatenated /Contents source
 :meth:`Page.wrap_contents`           wrap contents with stacking commands
+:meth:`Page._get_texttrace()`        low-level text information
 :attr:`Page.is_wrapped`              check whether contents wrapping is present
 :meth:`planish_line`                 matrix to map a line to the x-axis
 :meth:`paper_size`                   return width, height for a known paper format
@@ -396,12 +397,79 @@ Yet others are handy, general-purpose utilities.
 
 -----
 
-   .. method:: Page.wrap_contents
+   .. method:: Page.wrap_contents()
 
       Put string pair "q" / "Q" before, resp. after a page's */Contents* object(s) to ensure that any "geometry" changes are **local** only.
 
       Use this method as an alternative, minimalistic version of :meth:`Page.clean_contents`. Its advantage is a small footprint in terms of processing time and impact on the data size of incremental saves.
 
+-----
+
+   .. method:: Page._get_texttrace()
+
+      *New in v1.18.16*
+
+      Return low-level text information of the page (**all** document types). This is a list of Python dictionaries with the following content::
+
+        {
+            'ascender': 0.75,                   # font ascender (1)
+            'bidi': 0,                          # bidirectional level (1)
+            'chars': (                          # char information, tuple[tuple]
+                  (32,                          # unicode (4)
+                   3,                           # glyph id (font dependent)
+                   (470.3800354003906,          # origin.x (1)
+                    755.3758544921875),         # origin.y (1)
+                   2.495859366375953            # width (points) 
+                  ),
+               ),
+            'color': (0.0,),                    # text color, tuple[float] (1)
+            'colorspace': 1,                    # number of colorspace components (1)
+            'descender': -0.25,                 # font descender (1)
+            'dir': (1.0, 0.0),                  # writing direction (1)
+            'flags': 4,                         # font flags (1)
+            'font': 'Calibri',                  # font name (1)
+            'linewidth': 0.5519999980926514,    # last know line width value (3)
+            'opacity': 1.0,                     # alpha value of the text (5)
+            'scissor': (1.0, 1.0, -1.0, -1.0),  # <ignore>
+            'size': 11.039999961853027,         # font size (1)
+            'spacewidth': 2.495859366375953,    # width of space character (synthesized)
+            'type': 0,                          # span type (2)
+            'wmode': 0                          # writing mode (1)
+        }
+
+      Details:
+
+      1. Same meaning as explained in :ref:`TextPage`.
+      2. There are 5 text span types:
+
+         0. Filled text -- equivalent to PDF text rendering mode 0 (``0 Tr``), only the characters' inside is shown.
+         1. Stroked text -- equivalent to ``1 Tr``, only the character borders are shown.
+         2. Clipped text -- details yet unknown.
+         3. Clip-stroked text -- details yet unknown.
+         4. Ignored text -- equivalent to ``3 Tr``.
+
+      3. Line width in this context is important only for processing ``span["type"] != 0``: it determines the thickness of stroked lines. This value may not be provided in the data. In this case, a value of ``span["size"] * 0,05`` is generated. Often, an "artificial" bold text in PDF is created by ``2 Tr``. There is no equivalent text span type for this case. Instead, respective text is represented by two consecutive spans -- which are identical in every aspect, except one is ``span["type"] = 0`` and the other one ``span["type"] = 1``.
+      4. For data compactness, the character's unicode is provided here. Use function ``chr()`` for the character itself.
+      5. The alpha / pacity value of the span's text, 0 <= opacity <= 1. Zero is invisible text, 1 (100%) covers what is behind.
+
+      Here is a list of similarities and differences of ``page._get_texttrace()`` compared to ``page.get_text("rawdict")``:
+
+      * The method is up to **twice as fast.**
+      * The returned information is very **much smaller in size.**
+      * Additional types of text **invisibility can be detected**: opacity = 0 and type = 4.
+      * Character bboxes are not provided; if needed, compute them from available information.
+      * If MuPDF returns unicode 0xFFFD (65533) for unrecognized characters, you may still be able to deduct required information from the glyph id.
+      * The ``span["chars"]`` **contains no spaces**, **except** the document creator has coded them. They **will never be generated** like it happens in :meth:`Page.get_text` methods. To provide some help for doing your own computations here, the width of a space character is given. This value is derived from the font where possible. Otherwise a synthetic value is taken.
+      * There is no effort to organize text like it happens for a :ref:`TextPage` (the hierarchy of blocks, lines, spans, and characters). Characters are simply extracted in sequence, one by one, and put in a span. Whenever any of the span's characteristics change, a new span is started. So you may find characters with different ``origin.y`` values in the same span. You cannot assume, that span characters are sorted in any particular order -- you must make sense of the info yourself, taking ``span["dir"]``, ``span["wmode"]``, etc. into account.
+      * Ligatures are represented like this:
+         - MuPDF handles these ligatures: "fi", "ff", "fl", "ft", "st", "ffi", and "ffl". If the page contains e.g. ligature "fi", you will find the following two character items subsequent to each other::
+         
+            (102, glyph, (x, y), width)  # 102 = ord("f")
+            (105, -1, (x, y), 0)         # 105 = ord("i")
+
+         - This means that the ligature character components are shown combined within the space given by width. It is up to you, how you want to handle these cases in your text extraction. This is similar to ``page.get_text("rawdict")``: a glyph id is never available there, but you can assume a ligature if you encounter one of the character combinations above, having the **same origin** and ``bbox.width = 0`` except for the first character.
+
+
 -----
 
    .. attribute:: Page.is_wrapped
@@ -412,13 +480,13 @@ Yet others are handy, general-purpose utilities.
 
    .. method:: Page.get_text_blocks(flags=None)
 
-      Deprecated wrapper for :meth:`TextPage.extractBLOCKS`.  Use :meth:`Page.getText` with the "blocks" option instead.
+      Deprecated wrapper for :meth:`TextPage.extractBLOCKS`.  Use :meth:`Page.get_text` with the "blocks" option instead.
 
 -----
 
    .. method:: Page.get_text_words(flags=None)
 
-      Deprecated wrapper for :meth:`TextPage.extractWORDS`. Use :meth:`Page.getText` with the "words" option instead.
+      Deprecated wrapper for :meth:`TextPage.extractWORDS`. Use :meth:`Page.get_text` with the "words" option instead.
 
 -----
 

diff --git a/docs/images/img-layout-text.jpg b/docs/images/img-layout-text.jpg
diff --git a/docs/link.rst b/docs/link.rst
@@ -12,10 +12,12 @@ There is a parent-child relationship between a link and its page. If the page ob
 ========================= ============================================
 :meth:`Link.set_border`   modify border properties
 :meth:`Link.set_colors`   modify color properties
+:meth:`Link.set_flags`    modify link flags
 :attr:`Link.border`       border characteristics
 :attr:`Link.colors`       border line color
 :attr:`Link.dest`         points to destination details
 :attr:`Link.is_external`  external destination?
+:attr:`Link.flags`        link annotation flags
 :attr:`Link.next`         points to next link
 :attr:`Link.rect`         clickable area in untransformed coordinates.
 :attr:`Link.uri`          link destination
@@ -40,7 +42,7 @@ There is a parent-child relationship between a link and its page. If the page ob
 
    .. method:: set_colors(colors=None, stroke=None)
 
-      Changes the "stroke" color.
+      PDF only: Changes the "stroke" color.
 
       .. note:: In PDF, links are a subtype of annotations technically and **do not support fill colors**. However, to keep a consistent API, we do allow specifying a ``fill=`` parameter like with all annotations, which will be ignored with a warning.
 
@@ -49,6 +51,19 @@ There is a parent-child relationship between a link and its page. If the page ob
       :arg dict colors: a dictionary containing color specifications. For accepted dictionary keys and values see below. The most practical way should be to first make a copy of the *colors* property and then modify this dictionary as required.
       :arg sequence stroke: see above.
 
+   .. method:: set_flags(flags)
+
+      *New in v1.18.16*
+
+      Set the PDF ``/F`` property of the link annotation. See :meth:`Annot.set_flags` for details. If not a PDF, this method is a no-op.
+
+
+   .. attribute:: flags
+
+      *New in v1.18.16*
+
+      Return the link annotation flags, an integer (see :attr:`Annot.flags` for details). Zero if not a PDF.
+
 
    .. attribute:: colors