Skip to content

ghostscript

Marcel Schmalzl edited this page Apr 10, 2024 · 6 revisions

Ghostscript

Compress PDFs

This script will create a compressed file "output.pdf" of the original PDF handed as argument:

gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dDownsampleColorImages=true -dColorImageResolution=150 -dNOPAUSE  -dBATCH -sOutputFile=output.pdf input.pdf

-> This works also for documents which do not allow text highlighting/markup, ...

Other pdf tools

Requirements

  • gs (sudo apt-get install ghostscript)
  • pdftk (sudo apt-get install pdftk)
  • pdfjam (sudo apt-get install pdfjam)
  • pdftocairo (sudo apt-get install poppler-utils)

Repair broken pdf's

First try using ghostscript | also useful for pdf compression

gs -o output.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress input.pdf 
#                                                ^^^^^^
#                                    = quality (prepress = highest)
# ----------------------------------------------------------------------------
# Set quality:
# -dPDFSETTINGS=/screen   (screen-view-only quality, 72 dpi images)
# -dPDFSETTINGS=/ebook    (low quality, 150 dpi images)
# -dPDFSETTINGS=/printer  (high quality, 300 dpi images)
# -dPDFSETTINGS=/prepress (high quality, color preserving, 300 dpi imgs)
# -dPDFSETTINGS=/default  (almost identical to /screen)
# ----------------------------------------------------------------------------
# More fine grained reductions:
# -dDownsampleColorImages=true -dColorImageResolution=110

Second try using poppler-utils

If rescure with ghostscript fails but e.g. evince (but not Adobe Reader as an example) displays it correctly it give poppler-utils a try:

pdftocairo -pdf input.pdf output.pdf

Unify page size

Make all pages -papersize (units) or --paper (format; f.e.x --paper a4paper) big using pdfjam.

pdfjam --outfile output.pdf --papersize '{5.5in,8.5in}' input.pdf

Cut out pages / Slicing

Using pdftk with cat option:

pdftk longPdf.pdf cat 12-15 60 65-end output outfile_p12-15+p60+65-lastPage.pdf

OCR

The sad story: it is really hard to get pdf's with OCR generated text overlay (>> searchable pdf's).

Tesseract

From version 4 (uses neural nets) and upwards tesseract produces quite good results (currently an alpha version and you need to do some hacks to compile it from sources; f.ex. commening out the if statement where leptonica is searched and only use cmake's find_package()).

tesseract test-onepager.tif -l deu out pdf
#         ^^^^^^^^^^^^^^^^  ^^^^^      ^^
#    input: (must be tiff)   ||        | 
#                       language     output a pdf instead of an image

For languages you may need to download additional tessdata files (into the 'tessdata' directory probably located in /usr/share/tesseract-ocr/tessdata or /usr/share/tessdata).

Exports pdf's with text overlay using the pdf option like above.

PDF-XChange Editor

OK. This is not the best option (and Windows and NOT open-source tool) but it works quite well. There is also a portable version and work under Linux using Wine or PlayOnLinux (not tested by my side).

PDF-XChange Editor has an integraded OCR Engine which works well. But no Tesseract integration.

Exports pdf's with text overlay.

Adobe Acrobat

In my experience the best option but non-free.

Clone this wiki locally