Skip to content

Conversation

@flooie
Copy link
Contributor

@flooie flooie commented Apr 9, 2025

Cross Hatch [XXXX] style redactions always fail because we require redactions to be one solid color.

This PR replaces that check with a more nuanced check of a block. It uses adaptive thresholding to
determine if a pixel is white or black and then determines if the redaction is a solid color.

Additionally, updates the readme around testing and adds a test for our Cross Hatch example

flooie and others added 4 commits April 9, 2025 15:47
Use adaptive thresholding on blocks
to allow the inclusion of XXXX redactions
Add new test for cross hatch redactions

Update change log and
Update readme to provide instructions
on testing
@mlissner
Copy link
Member

mlissner commented Apr 9, 2025

Looks like you forgot to include the PDF? Not sure how tests passed that way or what's up with the lint. 🤔

Comment on lines 233 to 240
def test_inspect_works_with_crosshatch(self):
"""Test gray scaled redactions"""
path = root_path / "bad_cross_hatched_redactions.pdf"
with open(path, "rb") as f:
data = f.read()

redactions = xray.inspect(data)
self.assertTrue(redactions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a very similar test up on line 147. It seems like the cross hatch was working before?

Do we have an example where xray wasn't working? If so, we should add it to OcclusionTest instead of InspectApiTest. This is for making sure the API works properly, OcclusionTest is where we test weird PDF examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    def test_cross_hatches_are_ok(self):
        path = root_path / "bad_cross_hatched_redactions.pdf"
        with fitz.open(path) as pdf:
            page = pdf[0]
            chars = get_intersecting_chars(page, get_good_rectangles(page))
        self.assertEqual(len(chars), 639)

this test was doing something else. It was simply testing if we could find the characters. It was not actually identifying the redactions. Part of what I think tripped you up was that this file returns Zero redactions.

    bad_redactions = []
    for redaction in redactions:
        pixmap = page.get_pixmap(
            # Use gray for simplicity and speed, though this risks missing a
            # bad redaction.
            colorspace=fitz.csRGB,
            clip=fitz.Rect(redaction["bbox"]),
        )
        if multi_colored(pixmap, threshold=128): <---   was `if not pixmap.is_unicolor:`
            # There's some degree of variation in the colors of the pixels.
            # ∴ it's not a uniform box and it's not a bad redaction.
            # filename = f'{redaction["text"].replace("/", "_")}.png'
            # pixmap.save(filename)
            continue
        bad_redactions.append(redaction)
    return bad_redactions

those occlusions would get decimated here. if not pixmap.is_unicolor: says is the whole pixmap here one color. if it is we get to keep it. but none of the cross hatches is solid black ... its a mix of black and gray.

This test is validating that crosshatch redactions are viewed as failed redactions.

long story short is you just need to run inspect on this pdf to see that it returns {}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. You're right. But can we move this to IntegrationTest then?

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, a couple more thoughts!

Comment on lines 298 to 304
def multi_colored(pixmap, threshold=128) -> bool:
"""Check if a pixmap contains black and white pixels.
:param pixmap: The pixmap to check
:param threshold: The threshold to use for black and white pixels
:return true if black and white else false
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer if this was is_black_and_white, without the threshold parameter, and if the threshold could be hardcoded with a comment explaining how it worked.

If that's not a great idea, I'd love more documentation about what threshold means and why it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can drop the parameter and hard code it.

I added this comment

Every pixel with an intensity below 128 is set to black,
and every pixel with an intensity of 128 or above is set to white

This allows us to convert gray scale [X][X] to black but otherwise retain
text.

I think its more about detecting light and dark pixels not black and white but assigns black or white

so detect_light_and_dark ?

Comment on lines 233 to 240
def test_inspect_works_with_crosshatch(self):
"""Test gray scaled redactions"""
path = root_path / "bad_cross_hatched_redactions.pdf"
with open(path, "rb") as f:
data = f.read()

redactions = xray.inspect(data)
self.assertTrue(redactions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. You're right. But can we move this to IntegrationTest then?

@flooie flooie requested a review from mlissner April 10, 2025 12:05
@flooie
Copy link
Contributor Author

flooie commented Apr 10, 2025

@mlissner thanks for the feedback ... I updated it.

Copy link
Member

@mlissner mlissner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good. One more thing that'd help me make sense of this.

Comment on lines +383 to +387
self.assertNotEqual(
bad_redactions,
{},
msg="Missing cross hatch bad redactions.",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be better if the assertion checked for what we did want, so that the test can show what's expected. For example, above we do:

self.assertEqual(len(redactions[1]), 3)

Better to say: "I assert this document has X bad redactions" instead of "I assert this document does not have zero bad redactions."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants