upload v1.18.2

pymupdf · Oct 27, 2020 · 071c56d · 071c56d
1 parent 1cc6d70
commit 071c56d
Show file tree

Hide file tree

Showing 18 changed files with 581 additions and 178 deletions.
diff --git a/PKG-INFO b/PKG-INFO
@@ -1,6 +1,6 @@
 Metadata-Version: 1.1
 Name: PyMuPDF
-Version: 1.18.1
+Version: 1.18.2
 Author: Jorj McKie
 Author-email: jorj.x.mckie@outlook.de
 Maintainer: Jorj McKie
@@ -9,7 +9,7 @@ Home-page: https://github.com/pymupdf/PyMuPDF
 Download-url: https://github.com/pymupdf/PyMuPDF
 Summary: PyMuPDF is a Python binding for the PDF rendering library MuPDF
 Description:
-        Release date: October 18, 2020
+        Release date: October 7, 2020
 
         Authors
         =======
@@ -20,7 +20,7 @@ Description:
         Introduction
         ============
 
-        This is **version 1.18.1 of PyMuPDF**, a Python binding for `MuPDF <http://mupdf.com/>`_ - "a lightweight PDF and XPS viewer".
+        This is **version 1.18.2 of PyMuPDF**, a Python binding for `MuPDF <http://mupdf.com/>`_ - "a lightweight PDF and XPS viewer".
 
         MuPDF can access files in PDF, XPS, OpenXPS, epub, comic and fiction book formats, and it is known for both, its top performance and high rendering quality.
 

diff --git a/README.md b/README.md
@@ -1,8 +1,8 @@
-# PyMuPDF 1.18.1
+# PyMuPDF 1.18.2
 
 ![logo](https://github.com/pymupdf/PyMuPDF/blob/master/demo/pymupdf.jpg)
 
-Release date: October 18, 2020
+Release date: October 27, 2020
 
 **Travis-CI:** [![Build Status](https://travis-ci.org/JorjMcKie/py-mupdf.svg?branch=master)](https://travis-ci.org/JorjMcKie/py-mupdf)
 
@@ -14,7 +14,7 @@ On **[PyPI](https://pypi.org/project/PyMuPDF)** since August 2016: [![](https://
 
 # Introduction
 
-This is **version 1.18.1 of PyMuPDF**, a Python binding with support for [MuPDF 1.18.*](http://mupdf.com/) - "a lightweight PDF, XPS, and E-book viewer".
+This is **version 1.18.2 of PyMuPDF**, a Python binding with support for [MuPDF 1.18.*](http://mupdf.com/) - "a lightweight PDF, XPS, and E-book viewer".
 
 MuPDF can access files in PDF, XPS, OpenXPS, CBZ, EPUB and FB2 (e-books) formats, and it is known for its top performance and high rendering quality.
 

diff --git a/docs/annot.rst b/docs/annot.rst
@@ -90,7 +90,7 @@ There is a parent-child relationship between an annotation and its page. If the
 
       *(New in 1.18.0)*
 
-      Retrieves the content of the annotation in a variety of formats -- much like the same method for :ref:`Page`.. This currently only delivers relevant data for annotation types 'FreeText' and 'Stamp'. Other type will return an empty string (or equivalent objects).
+      Retrieves the content of the annotation in a variety of formats -- much like the same method for :ref:`Page`.. This currently only delivers relevant data for annotation types 'FreeText' and 'Stamp'. Other types return an empty string (or equivalent objects).
 
       :arg str opt: the desired format - one of the following values. Please note that this method works exactly like the same-named method of :ref:`Page`.
 

diff --git a/docs/changes.rst b/docs/changes.rst
@@ -1,6 +1,21 @@
 Change Logs
 ===============
 
+Changes in Version 1.18.2
+---------------------------
+This version contains some interesting improvements for text searching: any number of search hits is now returned thanks to the removal of the **hit_max** parameter. The new **clip** parameter in addition allows to restrict the search area. Searching now detects hyphenations at line breaks and accordingly finds hyphenated words.
+
+* **Fixed** issue `#575 <https://github.com/pymupdf/PyMuPDF/issues/575>`_: if using ``quads=False`` in text searching, then overlapping rectangles on the same line are joined. Previously, parts of the search string, which belonged to different "marked content" items, each generated their own rectangle -- just as if occurring on separate lines.
+* **Added** :attr:`Document.isRepaired`, which is true if the PDF was repaired on open.
+* **Added** :meth:`Document.setXmlMetadata` which either updates or creates PDF XML metadata. Implements issue `#691 <https://github.com/pymupdf/PyMuPDF/issues/691>`_.
+* **Added** :meth:`Document.getXmlMetadata` returns PDF XML metadata.
+* **Changed** creation of PDF documents: they will now always carry a PDF identification (``/ID`` field) in the document trailer. Implements issue `#691 <https://github.com/pymupdf/PyMuPDF/issues/691>`_.
+* **Changed** :meth:`Page.searchFor`: a new parameter ``clip`` is accepted to restrict the search to this rectangle. Correspondingly, the attribute :attr:`TextPage.rect` is now respected by :meth:`TextPage.search`.
+* **Changed** parameter ``hit_max`` in :meth:`Page.searchFor` and :meth:`TextPage.search` is now obsolete: methods will return all hits.
+* **Changed** character **selection criteria** in :meth:`Page.getText`: a character is now considered to be part of a ``clip`` if its bbox is fully contained. Before this, a non-empty intersection was sufficient.
+* **Changed** :meth:`Document.scrub` to support a new option `redact_images`. This addresses issue `#697 <https://github.com/pymupdf/PyMuPDF/issues/697>`_.
+
+
 Changes in Version 1.18.1
 ---------------------------
 * **Fixed** issue `#692 <https://github.com/pymupdf/PyMuPDF/issues/692>`_. PyMuPDF now detects and recovers from more cyclic resource dependencies in PDF pages and for the first time reports them in the MuPDF warnings store.
@@ -11,7 +26,7 @@ Changes in Version 1.18.1
 
 Changes in Version 1.18.0
 ---------------------------
-This is first PyMuPDF version supporting MuPDF v1.18. The goal here is on extending PyMuPDF's own functionality -- apart from bug fixing. Subsequent PyMuPDF patches may address features new in MuPDF.
+This is the first PyMuPDF version supporting MuPDF v1.18. The focus here is on extending PyMuPDF's own functionality -- apart from bug fixing. Subsequent PyMuPDF patches may address features new in MuPDF.
 
 * **Fixed** issue `#519 <https://github.com/pymupdf/PyMuPDF/issues/519>`_. This upstream bug occurred occasionally for some pages only and seems to be fixed now: page layout should no longer be ruined in these cases.
 

diff --git a/docs/conf.py b/docs/conf.py
@@ -40,7 +40,7 @@
 # built documents.
 #
 # The full version, including alpha/beta/rc tags.
-release = "1.18.1"
+release = "1.18.2"
 
 # The short X.Y version
 version = release

diff --git a/docs/document.rst b/docs/document.rst
@@ -49,6 +49,7 @@ For details on **embedded files** refer to Appendix 3.
 :meth:`Document.getSigFlags`            PDF only: determine signature state
 :meth:`Document.getToC`                 create a table of contents
 :meth:`Document.getTOC`                 alias of previous
+:meth:`Document.getXmlMetadata`         PDF only: read the XML metadata
 :meth:`Document.insertPage`             PDF only: insert a new page
 :meth:`Document.insertPDF`              PDF only: insert pages from another PDF
 :meth:`Document.layout`                 re-paginate the document (if supported)
@@ -76,6 +77,7 @@ For details on **embedded files** refer to Appendix 3.
 :meth:`Document.setTOC_item`            PDF only: change a single TOC item
 :meth:`Document.setToC`                 PDF only: set the table of contents (TOC)
 :meth:`Document.setTOC`                 PDF only: alias of previous
+:meth:`Document.setXmlMetadata`         PDF only: create or update document XML metadata
 :meth:`Document.updateObject`           PDF only: replace object source
 :meth:`Document.updateStream`           PDF only: replace stream source
 :meth:`Document.write`                  PDF only: writes document to memory
@@ -90,6 +92,7 @@ For details on **embedded files** refer to Appendix 3.
 :attr:`Document.isFormPDF`              is this a Form PDF?
 :attr:`Document.isPDF`                  is this a PDF?
 :attr:`Document.isReflowable`           is this a reflowable document?
+:attr:`Document.isRepaired`             PDF only: has this PDF been repaired during open?
 :attr:`Document.lastLocation`           (chapter, pno) of last page
 :attr:`Document.metadata`               metadata
 :attr:`Document.name`                   filename of document
@@ -513,10 +516,23 @@ For details on **embedded files** refer to Appendix 3.
 
     .. method:: setMetadata(m)
 
-      PDF only: Sets or updates the metadata of the document as specified in *m*, a Python dictionary. As with :meth:`select`, these changes become permanent only when you save the document. Incremental save is supported.
+      PDF only: Sets or updates the metadata of the document as specified in *m*, a Python dictionary.
 
       :arg dict m: A dictionary with the same keys as *metadata* (see below). All keys are optional. A PDF's format and encryption method cannot be set or changed and will be ignored. If any value should not contain data, do not specify its key or set the value to *None*. If you use *{}* all metadata information will be cleared to the string *"none"*. If you want to selectively change only some values, modify a copy of *doc.metadata* and use it as the argument. Arbitrary unicode values are possible if specified as UTF-8-encoded.
 
+    .. method:: getXmlMetadata()
+
+      PDF only: Get the document XML metadata.
+
+      :rtype: str
+      :returns: XML metadata of the document. Empty string if not present or not a PDF.
+
+    .. method:: setXmlMetadata(xml)
+
+      PDF only: Sets or updates XML metadata of the document.
+
+      :arg str xml: the new XML metadata. Should be XML syntax, however no checking is done by this method and any string is accepted.
+
     .. method:: setToC(toc, collapse=1)
 
     .. method:: setTOC(toc, collapse=1)
@@ -529,7 +545,7 @@ For details on **embedded files** refer to Appendix 3.
 
       :arg sequence toc:
 
-          A Python sequence (list or tuple) with **all bookmark entries** that should form the new table of contents. Output variants of :meth:`getToC` are acceptable. To completely remove the table of contents specify an empty sequence or None. Each item must be a list with the following format.
+          A list or tuple with **all bookmark entries** that should form the new table of contents. Output variants of :meth:`getToC` are acceptable. To completely remove the table of contents specify an empty sequence or None. Each item must be a list with the following format.
 
           * [lvl, title, page [, dest]] where
 
@@ -592,7 +608,7 @@ For details on **embedded files** refer to Appendix 3.
 
       Check whether the document can be saved incrementally. Use it to choose the right option without encountering exceptions.
 
-    .. method:: scrub(attached_files=True, clean_pages=True, embedded_files=True, hidden_text=True, javascript=True, metadata=True, redactions=True, remove_links=True, reset_fields=True, reset_responses=True, xml_metadata=True)
+    .. method:: scrub(attached_files=True, clean_pages=True, embedded_files=True, hidden_text=True, javascript=True, metadata=True, redactions=True, redact_images=0, remove_links=True, reset_fields=True, reset_responses=True, xml_metadata=True)
 
       PDF only: *(New in v1.16.14)* Remove potentially sensitive data from the PDF. This function is inspired by the similar "Sanitize" function in Adobe Acrobat products. The process is configurable by a number of options, which are all *True* by default.
 
@@ -603,6 +619,7 @@ For details on **embedded files** refer to Appendix 3.
       :arg bool javascript: Remove JavaScript sources.
       :arg bool metadata: Remove PDF standard metadata.
       :arg bool redactions: Apply redaction annotations.
+      :arg int redact_images: how to handle images if applying redactions. One of 0 (ignore), 1 (blank out overlaps) or 2 (remove).
       :arg bool remove_links: Remove all links.
       :arg bool reset_fields: Reset all form fields to their defaults.
       :arg bool reset_responses: Remove all responses from all annotations.
@@ -664,7 +681,7 @@ For details on **embedded files** refer to Appendix 3.
       :rtype: bytes
       :returns: a bytes object containing the complete document.
 
-    .. method:: searchPageFor(pno, text, hit_max=16, quads=False)
+    .. method:: searchPageFor(pno, text, quads=False)
 
        Search for "text" on page number "pno". Works exactly like the corresponding :meth:`Page.searchFor`. Any integer -inf < pno < pageCount is acceptable.
 
@@ -1054,7 +1071,7 @@ For details on **embedded files** refer to Appendix 3.
 
       *False* if this is not a PDF or has no form fields, otherwise the number of root form fields (fields with no ancestors).
 
-      Changed in version 1.16.4 Returns the total number of (root) form fields.
+      *(Changed in version 1.16.4)* Returns the total number of (root) form fields.
 
       :type: bool,int
 
@@ -1064,6 +1081,14 @@ For details on **embedded files** refer to Appendix 3.
 
       :type: bool
 
+    .. attribute:: isRepaired
+
+      *(New in v1.18.2)*
+
+      *True* if PDF has been repaired during open (because of major structure issues). Always *False* for non-PDF documents. If true, more details have been stored in ``TOOLS.mupdf_warnings()``, and :meth:`Document.can_save_incrementally` will return *False*.
+
+      :type: bool
+
     .. attribute:: needsPass
 
       Indicates whether the document is password-protected against access. This indicator remains unchanged -- **even after the document has been authenticated**. Precludes incremental saves if true.

diff --git a/docs/faq.rst b/docs/faq.rst
@@ -728,15 +728,12 @@ There is a standard search function to search for arbitrary text on a page: :met
 
 This method has advantages and drawbacks. Pros are
 
-* the search string can contain blanks and wrap across lines
-* upper or lower cases are treated equal
+* The search string can contain blanks and wrap across lines
+* Upper or lower case characters are treated equal
+* Word hyphenation at line ends is detected and resolved
 * return may also be a list of :ref:`Quad` objects to precisely locate text that is **not parallel** to either axis.
 
-Disadvantages:
-
-* you cannot determine the number of found items beforehand: if *hit_max* items are returned you do not know whether you have missed any.
-
-But you have other options::
+But you also have other options::
 
  import sys
  import fitz
@@ -1580,8 +1577,9 @@ This deals with splitting up pages of a PDF in arbitrary pieces. For example, yo
 
     # that's it, save output file
     doc.save("poster-" + src.name,
-             garbage = 3,                       # eliminate duplicate objects
-             deflate = True)                    # compress stuff where possible
+             garbage=3,  # eliminate duplicate objects
+             deflate=True,  # compress stuff where possible
+    )
 
 
 This shows what happens to an input page:
@@ -1652,7 +1650,7 @@ This deals with joining PDF pages to form a new PDF with pages each combining tw
                          spage.number)      # input page number
 
     # by all means, save new file using garbage collection and compression
-    doc.save("4up-" + infile, garbage = 3, deflate = True)
+    doc.save("4up-" + infile, garbage=3, deflate=True)
 
 Example effect:
 
@@ -1858,20 +1856,20 @@ Problem
 ^^^^^^^^^
 There are two scenarios:
 
-1. Updating an annotation, which has been created by some other software, via a PyMuPDF script.
-2. Creating an annotation with PyMuPDF and later changing it using some other PDF application.
+1. **Updating** an annotation with PyMuPDF which was created by some other software.
+2. **Creating** an annotation with PyMuPDF and later changing it with some other software.
 
-In both cases you may experience unintended changes like a different annotation icon or text font, the fill color or line dashing have disappeared, line end symbols have changed their size or even have disappeared too, etc.
+In both cases you may experience unintended changes, like a different annotation icon or text font, the fill color or line dashing have disappeared, line end symbols have changed their size or even have disappeared too, etc.
 
 Cause
 ^^^^^^
-Annotation maintenance is handled differently by each PDF maintenance application (if it is supported at all). For any given PDF application, some annotation types may not be supported at all or only partly, or some details may be handled in a different way than with another application.
+Annotation maintenance is handled differently by each PDF maintenance application. Some annotation types may not be supported, or not be supported fully or some details may be handled in a different way than in another application. **There is no standard.**
 
 Almost always a PDF application also comes with its own icons (file attachments, sticky notes and stamps) and its own set of supported text fonts. For example:
 
 * (Py-) MuPDF only supports these 5 basic fonts for 'FreeText' annotations: Helvetica, Times-Roman, Courier, ZapfDingbats and Symbol -- no italics / no bold variations. When changing a 'FreeText' annotation created by some other app, its font will probably not be recognized nor accepted and be replaced by Helvetica.
 
-* PyMuPDF fully supports the PDF text markers, but these types cannot be updated with Adobe Acrobat Reader.
+* PyMuPDF supports all PDF text markers (highlight, underline, strikeout, squiggly), but these types cannot be updated with Adobe Acrobat Reader.
 
 In most cases there also exists limited support for line dashing which causes existing dashes to be replaced by straight lines. For example: