Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apply_redactions() deleting text outside of annoted box #3257

Closed
rnik12 opened this issue Mar 12, 2024 · 7 comments
Closed

apply_redactions() deleting text outside of annoted box #3257

rnik12 opened this issue Mar 12, 2024 · 7 comments
Labels

Comments

@rnik12
Copy link

rnik12 commented Mar 12, 2024

Description of the bug

Check the attached pdf Original -> Annotated -> Redacted.

You can see the insult word is getting redacted -

Original -
"The woman whom it was intended to insult or whose privacy was intruded upon."

Redacted -
"The woman whom it was intended to ult or whose privacy was intruded upon."

How to reproduce the bug

Code

import fitz
import json

fitz.TOOLS.set_small_glyph_heights(True)

filepath = f"original.pdf"

doc = fitz.open(filepath)


def get_redact_box(bbox):
    return [round(x, 2) for x in bbox]


LEFT_START = 115
RIGHT_END = 482


def redact_text_outside_bounds(page_number):
    """Prints text and line numbers for a specific page of a PDF."""

    if page_number < 1 or page_number > len(doc):
        print("Invalid page number.")
        return

    page = doc[page_number - 1]  # Get the page object (zero-based indexing)
    blocks = page.get_text("dict", sort=True)["blocks"]

    for block in blocks:
        bbox = block["bbox"]
        x0, y0, x1, y1 = bbox

        for line in block["lines"]:
            bbox = line["bbox"]
            x0, y0, x1, y1 = bbox

            if x0 <= x1 and (x1 <= LEFT_START or x0 >= RIGHT_END):
                page.add_redact_annot(
                    get_redact_box(bbox),
                    fontname="helv",
                    fontsize=8,
                    align=fitz.TEXT_ALIGN_CENTER,
                )

        # text = ""
        # for line in block["lines"]:
        #     for d in line["spans"]:
        #         text += " " + d["text"]
        # print()
        # print(bbox)
        # print(text)

    page.apply_redactions()


if __name__ == "__main__":
    pages = range(1, doc.page_count+1)
    # pages = range(128, 129)
    for page_no in pages:
        redact_text_outside_bounds(page_no)

    doc.save("test.pdf")

original.pdf
annotated.pdf
redacted.pdf

PyMuPDF version

1.23.26

Operating system

MacOS

Python version

3.12

@JorjMcKie
Copy link
Collaborator

Cannot reproduce.
Tried several of my own rectangles and there was no problem.
To reduce my effort to understand your code, please provide an example where you simply draw a rectangle and then demonstrate that unintended text was deleted.

@rnik12
Copy link
Author

rnik12 commented Mar 12, 2024

@JorjMcKie I've updated the code. Made page range 1 to 2. It is reproducing. Can you check ?

@JorjMcKie
Copy link
Collaborator

This is the result after recent code version - all the number are removed and nothing else.
I suppose this is intended?
test.pdf

@rnik12
Copy link
Author

rnik12 commented Mar 12, 2024

@JorjMcKie that's strange. Attaching my Macbook M1 screenshot. Using python 3.12 and latest 1.23.26.

Maybe M1 architecture is causing this issue ? dependency / floats not getting compiled properly.

Screenshot 2024-03-12 at 8 06 44 PM Screenshot 2024-03-12 at 8 08 57 PM

@JorjMcKie
Copy link
Collaborator

let me it on Linux again - don't have a Mac to cross check.

@JorjMcKie
Copy link
Collaborator

Got it!
The problem is the base library: that problem is fixed in MuPDF. The next version of PyMuPDF will be linked to it, so the error will be fixed.

@JorjMcKie JorjMcKie added upstream bug bug outside this package Fixed in next release labels Mar 12, 2024
@julian-smith-artifex-com
Copy link
Collaborator

Fixed in 1.24.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants