Convert all text's color in PDF to black while ensuring text is selectable #1640
-
Looking for ways to change the color of all text in a PDF to black. Any ideas for the same? Code in font replacement sample scripts are useful, but that adds their own complexity. For instance, sometimes the space between words is much larger than in the original PDF. |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments 24 replies
-
That's not as easy as it sounds, unfortunately. You could of course walk through the page's |
Beta Was this translation helpful? Give feedback.
-
Sure, my bad. This creator was more creative with hiding color specs, but that script makes tabula rasa: import fitz
skips = (b"k", b"K", b"rg", b"RG", b"sc", b"SC", b"scn", b"SCN", b"gs", b"cs")
doc = fitz.open("Demo.pdf")
for page in doc:
page.clean_contents()
xref = page.get_contents()[0]
lines = page.read_contents().splitlines()
for i in range(len(lines)):
if lines[i].endswith(skips):
lines[i] = b""
doc.update_stream(xref, b"\n".join(lines))
doc.ez_save("x.pdf", pretty=True) The last page is now completely black BTW 😂 There will still be cases that won't work even with this version. I told you before, it won't be a completely trivial thing: |
Beta Was this translation helpful? Give feedback.
-
Here is a better version, which only makes text completely black, other stuff will be a light gray. So the last page is readable again. import fitz
skips = (b"k", b"K", b"rg", b"RG", b"sc", b"SC", b"scn", b"SCN", b"gs", b"cs")
doc = fitz.open("Demo.pdf")
for page in doc:
page.clean_contents()
xref = page.get_contents()[0]
lines = page.read_contents().splitlines()
for i in range(len(lines)):
if lines[i].endswith(skips):
lines[i] = b""
continue
if lines[i] == b"q":
lines[i] = b"q 0.9 g 0.9 G"
elif lines[i] == b"BT":
lines[i] = b"BT 0 g 0 G"
elif lines[i] == b"ET":
lines[i] = b"ET 0.9 g 0.9 G"
doc.update_stream(xref, b"\n".join(lines))
doc.ez_save("x.pdf", pretty=True) |
Beta Was this translation helpful? Give feedback.
-
The meaning of all those cryptic operators can be looked up in the PDF spec in chapter "Appendix A: Operator Summary", page 985 (old version) or 643 (new version). |
Beta Was this translation helpful? Give feedback.
-
MuPDF and Ghostscript are both products of Artifex. When they acquired the "Fitz" project, their original plan was to replace Ghostscript with it, but changed their mind and Fitz became the product MuPDF. |
Beta Was this translation helpful? Give feedback.
-
Once you have located an XObject that the page invokes, the same code snippet can be used for its stream. An XObject is a stream object. So this goes along the logic: page.clean_contents() # clean the page /Contents, but also each XObject!
xobjects = page.get_xobjects() # again each item has an xref as first subitem
for xobj in xobjects:
xref = xobj[0]
lines = doc.xref_stream(xref).splitlines()
# etc.: logic like above
doc.update_stream(xref, b"\n".join(lines)
# that's about it! |
Beta Was this translation helpful? Give feedback.
-
This failure is a storage violation inside PyMuPDF. This only happens intermittently and only if the document is deleted by garbage collection (usually automatically at end of script) while there still exists at least one page object. As an intermediate solution for you, do one of the following:
|
Beta Was this translation helpful? Give feedback.
-
As per the error I reported: |
Beta Was this translation helpful? Give feedback.
Here is a better version, which only makes text completely black, other stuff will be a light gray. So the last page is readable again.