-
-
Notifications
You must be signed in to change notification settings - Fork 46
169 fix failing redactions #170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Use adaptive thresholding on blocks to allow the inclusion of XXXX redactions
Add new test for cross hatch redactions Update change log and Update readme to provide instructions on testing
for more information, see https://pre-commit.ci
|
Looks like you forgot to include the PDF? Not sure how tests passed that way or what's up with the lint. 🤔 |
tests/test_utils.py
Outdated
| def test_inspect_works_with_crosshatch(self): | ||
| """Test gray scaled redactions""" | ||
| path = root_path / "bad_cross_hatched_redactions.pdf" | ||
| with open(path, "rb") as f: | ||
| data = f.read() | ||
|
|
||
| redactions = xray.inspect(data) | ||
| self.assertTrue(redactions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a very similar test up on line 147. It seems like the cross hatch was working before?
Do we have an example where xray wasn't working? If so, we should add it to OcclusionTest instead of InspectApiTest. This is for making sure the API works properly, OcclusionTest is where we test weird PDF examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def test_cross_hatches_are_ok(self):
path = root_path / "bad_cross_hatched_redactions.pdf"
with fitz.open(path) as pdf:
page = pdf[0]
chars = get_intersecting_chars(page, get_good_rectangles(page))
self.assertEqual(len(chars), 639)
this test was doing something else. It was simply testing if we could find the characters. It was not actually identifying the redactions. Part of what I think tripped you up was that this file returns Zero redactions.
bad_redactions = []
for redaction in redactions:
pixmap = page.get_pixmap(
# Use gray for simplicity and speed, though this risks missing a
# bad redaction.
colorspace=fitz.csRGB,
clip=fitz.Rect(redaction["bbox"]),
)
if multi_colored(pixmap, threshold=128): <--- was `if not pixmap.is_unicolor:`
# There's some degree of variation in the colors of the pixels.
# ∴ it's not a uniform box and it's not a bad redaction.
# filename = f'{redaction["text"].replace("/", "_")}.png'
# pixmap.save(filename)
continue
bad_redactions.append(redaction)
return bad_redactions
those occlusions would get decimated here. if not pixmap.is_unicolor: says is the whole pixmap here one color. if it is we get to keep it. but none of the cross hatches is solid black ... its a mix of black and gray.
This test is validating that crosshatch redactions are viewed as failed redactions.
long story short is you just need to run inspect on this pdf to see that it returns {}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. You're right. But can we move this to IntegrationTest then?
mlissner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, a couple more thoughts!
xray/pdf_utils.py
Outdated
| def multi_colored(pixmap, threshold=128) -> bool: | ||
| """Check if a pixmap contains black and white pixels. | ||
| :param pixmap: The pixmap to check | ||
| :param threshold: The threshold to use for black and white pixels | ||
| :return true if black and white else false | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer if this was is_black_and_white, without the threshold parameter, and if the threshold could be hardcoded with a comment explaining how it worked.
If that's not a great idea, I'd love more documentation about what threshold means and why it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can drop the parameter and hard code it.
I added this comment
Every pixel with an intensity below 128 is set to black,
and every pixel with an intensity of 128 or above is set to white
This allows us to convert gray scale [X][X] to black but otherwise retain
text.
I think its more about detecting light and dark pixels not black and white but assigns black or white
so detect_light_and_dark ?
tests/test_utils.py
Outdated
| def test_inspect_works_with_crosshatch(self): | ||
| """Test gray scaled redactions""" | ||
| path = root_path / "bad_cross_hatched_redactions.pdf" | ||
| with open(path, "rb") as f: | ||
| data = f.read() | ||
|
|
||
| redactions = xray.inspect(data) | ||
| self.assertTrue(redactions) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. You're right. But can we move this to IntegrationTest then?
|
@mlissner thanks for the feedback ... I updated it. |
mlissner
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good. One more thing that'd help me make sense of this.
| self.assertNotEqual( | ||
| bad_redactions, | ||
| {}, | ||
| msg="Missing cross hatch bad redactions.", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this might be better if the assertion checked for what we did want, so that the test can show what's expected. For example, above we do:
self.assertEqual(len(redactions[1]), 3)
Better to say: "I assert this document has X bad redactions" instead of "I assert this document does not have zero bad redactions."
Cross Hatch [XXXX] style redactions always fail because we require redactions to be one solid color.
This PR replaces that check with a more nuanced check of a block. It uses adaptive thresholding to
determine if a pixel is white or black and then determines if the redaction is a solid color.
Additionally, updates the readme around testing and adds a test for our Cross Hatch example