-
Notifications
You must be signed in to change notification settings - Fork 0
ghostscript
This script will create a compressed file "output.pdf" of the original PDF handed as argument:
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dDownsampleColorImages=true -dColorImageResolution=150 -dNOPAUSE -dBATCH -sOutputFile=output.pdf input.pdf
-> This works also for documents which do not allow text highlighting/markup, ...
- gs (
sudo apt-get install ghostscript
) - pdftk (
sudo apt-get install pdftk
) - pdfjam (
sudo apt-get install pdfjam
) - pdftocairo (
sudo apt-get install poppler-utils
)
gs -o output.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf
# ^^^^^^
# = quality (prepress = highest)
# ----------------------------------------------------------------------------
# Set quality:
# -dPDFSETTINGS=/screen (screen-view-only quality, 72 dpi images)
# -dPDFSETTINGS=/ebook (low quality, 150 dpi images)
# -dPDFSETTINGS=/printer (high quality, 300 dpi images)
# -dPDFSETTINGS=/prepress (high quality, color preserving, 300 dpi imgs)
# -dPDFSETTINGS=/default (almost identical to /screen)
# ----------------------------------------------------------------------------
# More fine grained reductions:
# -dDownsampleColorImages=true -dColorImageResolution=110
If rescure with ghostscript fails but e.g. evince (but not Adobe Reader as an example) displays it correctly it give poppler-utils a try:
pdftocairo -pdf input.pdf output.pdf
Make all pages -papersize
(units) or --paper
(format; f.e.x --paper a4paper
) big using pdfjam.
pdfjam --outfile output.pdf --papersize '{5.5in,8.5in}' input.pdf
Using pdftk with cat
option:
pdftk longPdf.pdf cat 12-15 60 65-end output outfile_p12-15+p60+65-lastPage.pdf
gs -dNOPAUSE -sDEVICE=pdfwrite -sOUTPUTFILE=merged.pdf -dBATCH *.pdf
The sad story: it is really hard to get pdf's with OCR generated text overlay (>> searchable pdf's).
From version 4 (uses neural nets) and upwards tesseract produces quite good results (currently an alpha version and you need to do some hacks to compile it from sources; f.ex. commening out the if statement where leptonica is searched and only use cmake's find_package()
).
tesseract test-onepager.tif -l deu out pdf
# ^^^^^^^^^^^^^^^^ ^^^^^ ^^
# input: (must be tiff) || |
# language output a pdf instead of an image
For languages you may need to download additional tessdata files (into the 'tessdata' directory probably located in /usr/share/tesseract-ocr/tessdata
or /usr/share/tessdata
).
Exports pdf's with text overlay using the pdf
option like above.
OK. This is not the best option (and Windows and NOT open-source tool) but it works quite well. There is also a portable version and work under Linux using Wine or PlayOnLinux (not tested by my side).
PDF-XChange Editor has an integraded OCR Engine which works well. But no Tesseract integration.
Exports pdf's with text overlay.
In my experience the best option but non-free.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License *.
Code (snippets) are licensed under a MIT License *.
* Unless stated otherwise