Add support for /Kids and /Limits in page labels #2560

stefan6419846 · 2024-03-30T08:56:27Z

Currently, /Kids and /Limits are not supported for page labels.

Some examples for the actual implementation might be found at #1519.

The text was updated successfully, but these errors were encountered:

Maintaining/validating example images inside a PR is complicated. Rather use the existing issue #2560 if there are new findings.

stefan6419846 · 2024-03-30T11:32:37Z

For an example file and the corresponding data see #2561 (comment)

Corresponding docs are "Table 37 – Entries in a number tree node dictionary" and "Table 159 – Entries in a page label dictionary".

stefan6419846 · 2024-03-30T12:25:10Z

I just gave it a try and it seems like the following patch is sufficient to generate the correct page numbers for the aforementioned document:

diff --git a/pypdf/_page_labels.py b/pypdf/_page_labels.py
index 6f41067..3a43f2a 100644
--- a/pypdf/_page_labels.py
+++ b/pypdf/_page_labels.py
@@ -57,7 +57,7 @@ a       Lowercase letters (a to z for the first 26 pages,
                            aa to zz for the next 26, and so on)
 """
 
-from typing import Iterator, Optional, Tuple, cast
+from typing import Iterator, List, Optional, Tuple, cast
 
 from ._protocols import PdfCommonDocProtocol
 from ._utils import logger_warning
@@ -131,7 +131,8 @@ def index2label(reader: PdfCommonDocProtocol, index: int) -> str:
     if "/PageLabels" not in root:
         return str(index + 1)  # Fallback
     number_tree = cast(DictionaryObject, root["/PageLabels"].get_object())
-    if "/Nums" in number_tree:
+
+    def handle_nums(dictionary_object: DictionaryObject) -> str:
         # [Nums] shall be an array of the form
         #   [ key 1 value 1 key 2 value 2 ... key n value n ]
         # where each key_i is an integer and the corresponding
@@ -139,7 +140,7 @@ def index2label(reader: PdfCommonDocProtocol, index: int) -> str:
         # The keys shall be sorted in numerical order,
         # analogously to the arrangement of keys in a name tree
         # as described in 7.9.6, "Name Trees."
-        nums = cast(ArrayObject, number_tree["/Nums"])
+        nums = cast(ArrayObject, dictionary_object["/Nums"])
         i = 0
         value = None
         start_index = 0
@@ -165,16 +166,18 @@ def index2label(reader: PdfCommonDocProtocol, index: int) -> str:
         start = value.get("/St", 1)
         prefix = value.get("/P", "")
         return prefix + m[value.get("/S")](index - start_index + start)
-    if "/Kids" in number_tree or "/Limits" in number_tree:
-        logger_warning(
-            (
-                "/Kids or /Limits found in PageLabels. "
-                "Please share this PDF with pypdf: "
-                "https://github.com/py-pdf/pypdf/pull/1519"
-            ),
-            __name__,
-        )
-    # TODO: Implement /Kids and /Limits for number tree
+
+    if "/Nums" in number_tree:
+        return handle_nums(number_tree)
+
+    if "/Kids" in number_tree:
+        kids: List[DictionaryObject] = number_tree["/Kids"]
+        for kid in kids:
+            limits: List[int] = kid["/Limits"]
+            if limits[0] <= index <= limits[1]:
+                return handle_nums(kid)
+
+    logger_warning(f"Could not reliably determine page label for {index}.")
     return str(index + 1)  # Fallback if /Nums is not in the number_tree

This more or less is the same as before, only looking into the /Limits of the /Kids to see which IndirectObject belongs to the current index.

Maintaining/validating example images inside a PR is complicated. Rather use the existing issue #2560 if there are new findings.

stefan6419846 added a commit that referenced this issue Mar 30, 2024

DEV: Remove page labels PR link from message

84aefba

Maintaining/validating example images inside a PR is complicated. Rather use the existing issue #2560 if there are new findings.

stefan6419846 mentioned this issue Mar 30, 2024

DEV: Remove page labels PR link from message #2561

Merged

pubpub-zz pushed a commit that referenced this issue Mar 30, 2024

DEV: Remove page labels PR link from message (#2561)

a36f9b0

Maintaining/validating example images inside a PR is complicated. Rather use the existing issue #2560 if there are new findings.

stefan6419846 mentioned this issue Mar 30, 2024

ENH: Add support for /Kids in page labels #2562

Merged

MartinThoma added the is-feature A feature request label Mar 31, 2024

MartinThoma mentioned this issue Mar 31, 2024

ENH: Add support for page labels #1519

Merged

pubpub-zz closed this as completed in #2562 Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for /Kids and /Limits in page labels #2560

Add support for /Kids and /Limits in page labels #2560

stefan6419846 commented Mar 30, 2024

stefan6419846 commented Mar 30, 2024

stefan6419846 commented Mar 30, 2024

Add support for /Kids and /Limits in page labels #2560

Add support for /Kids and /Limits in page labels #2560

Comments

stefan6419846 commented Mar 30, 2024

stefan6419846 commented Mar 30, 2024

stefan6419846 commented Mar 30, 2024