How to get drawing order of objects? #1279
-
I'm building a tool for identifying bad redactions that look like this: So far the tool that we've developed is working pretty well, but it has a lot of false positives on PDFs that have rectangles under text instead of on top of it. This is fine, for example, but our program identifies the gray rectangles as a problem: A colleague of mine that's good at PDF-stuff says that the way to know if text is on top of a rectangle or under it is by looking at the "drawing order" of the PDF. My code currently uses Is there something I'm missing or a trick I should be using? Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 10 comments 18 replies
-
The observation of your colleague is correct: I have been thinking about all this a lot, but I have no solution at hand yet. Your use case is a valid, but not a frequent one if you allow this comment. You could take a look at the trace device's output directly. This is XML. I have attached an example for you. It contains a PDF with a rectangle upon which some text is written. The xml file has been created by Looking at the XML source you can easily see that the drawing comes before all the text: <document filename="text-oc.pdf">
<page number="1" mediabox="0 0 595 842">
<set_default_colorspaces gray="DeviceGray" rgb="DeviceRGB" cmyk="DeviceCMYK" oi="None"/>
<layer name="ocg2"/>
<group bbox="35 35 365 345" isolated="0" knockout="1" blendmode="Normal" alpha="1">
<fill_path winding="nonzero" colorspace="DeviceGray" color=".5" alpha=".5" transform="1 0 0 -1 0 842">
<moveto x="45" y="507"/>
<lineto x="355" y="507"/>
<lineto x="355" y="797"/>
<lineto x="45" y="797"/>
<closepath/>
</fill_path>
<stroke_path linewidth="1" miterlimit="10" linecap="0,0,0" linejoin="0" dash_phase="0" dash="3 1" colorspace="DeviceRGB" color="0 0 1" alpha=".5" transform="1 0 0 -1 0 842">
<moveto x="45" y="507"/>
<lineto x="355" y="507"/>
<lineto x="355" y="797"/>
<lineto x="45" y="797"/>
<closepath/>
</stroke_path>
</group>
<end_layer/>
<layer name="ocg1"/>
<fill_text colorspace="DeviceGray" color="0" transform="1 0 0 -1 0 842">
<span font="Ubuntu-Italic" wmode="0" bidi="0" trm="11 0 0 11">
<g unicode="D" glyph="39" x="52.2" y="781.748" adv=".691"/>
<g unicode="e" glyph="72" x="59.801004" y="781.748" adv=".518"/>
<g unicode="r" glyph="85" x="65.499" y="781.748" adv=".372"/>
... and so on ... The commands between A few more notes:
|
Beta Was this translation helpful? Give feedback.
-
You made me thinking about a way out of all this. How about the following:
Then you would be able to tell whether some text bbox is contained in some drawing (or rather "fill-path") bbox encountered further down in the bbox list ... Your reaction? |
Beta Was this translation helpful? Give feedback.
-
😂 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Some comments / questions:
|
Beta Was this translation helpful? Give feedback.
-
It actually is independent from the percentage approach: even a character with minimized bounding box could be partially contained / covered by a drawing, couldn't it. Please find your wheel here. I hope you can access it. It is wrapped in a ZIP called "linux-wheel3x". |
Beta Was this translation helpful? Give feedback.
-
I have done more work on this:
I made a new wheel with those changes. To test the functionality, you might start with this script: As you will see, it is very much shorter - and also much faster than the other solution. It also requires no use of |
Beta Was this translation helpful? Give feedback.
-
I was finally able to test this today and rewrite my code to use it. A few responses; sorry to take so long:
Anyway, I've got a working method now, so that's good. Thanks again for all your help and efforts! I'm excited to find out how many court records are revealing things they shouldn't be, but I'm going to hold off on running that to see if you have thoughts about performance, since lamba charges by the millisecond. PS, if you're curious, this is the news that kept me from returning to this. It was on the cover of the WSJ. Kind of a big deal for us, but sorry it impacted this discussion. I doubt this project will do as well, but I do hope to get some press for it and try to get some reforms to protect people's privacy. |
Beta Was this translation helpful? Give feedback.
-
@mlissner - latest news: Question for you:
If you agree, the character items would look like
I also think I should always compute the character heights as if |
Beta Was this translation helpful? Give feedback.
-
@mlissner - There are the following changes:
Official publication of v1.19.0 is still a few days away, but the features you are presumably interested in will not be touched anymore. I will now focus on the new integrated Tesseract OCR support and get rid of remaining rough edges there. Here is a ZIP containing your example PDF with some added black drawings covering text. |
Beta Was this translation helpful? Give feedback.
@mlissner -
Here you can find my newest version 1.19.0.
There are the following changes:
page.get_texttrace()
.Official publication of v1.19.0 is still a few days away, but the features you are presumably interested in will not be touched anymore. I will now focus on the new integrated Tesseract OCR support and get rid of remaining rough edges there.
But please do continue to ask questions and express admonitions you may have on cove…