worknotes.txt


TODO:
- /orientation endpoint is really slow! May be evaluate only couple of pages... or find faster algorithm.
- /ocr endpoints need way to create different directories based on settings (like rotate/:angle for example)
- /noteshrink endpoint needs settings differentation also.
- /extracted/images need extracted/png and extracted/tiff endpoints
- /combined there should be option to create searchable PDF with visible text but without images (original layout preserved).
- /images/pdf creating new PDF from user's own images does not work since "Refusing to work on images with alpha channel"
- add /rotate/pdf/ endpoint for rotating images in PDF (pdfjam)?


PDF OCR

 mkdir images -p && pdftoppm -r 300 -cropbox 2000.pdf images/page

 mkdir images -p && pdftoppm -cropbox 2000.pdf images/page

tesseract -l fin files.txt out pdf

tesseract -l fin files.txt out


convert \
page-000.jpg -threshold 60% \
-define connected-components:area-threshold=5 \
-define connected-components:mean-color=true \
-connected-components 8 \
-bordercolor white -border 100x100 -fuzz 25500 -trim  pageout.jpg

convert \
page-001.jpg -colorspace Gray -threshold 60% \
-define connected-components:area-threshold=5 \
-define connected-components:mean-color=true \
-connected-components 8 \
-shave 10x10 -bordercolor white -border 10x10 -fuzz 25500 -trim  pageout.jpg


convert \
page-001.jpg -colorspace Gray -shave 10x10 -bordercolor white -border 1x1 -threshold 60% \
-define connected-components:area-threshold=5 \
-define connected-components:mean-color=true \
-connected-components 8 \
-fuzz 25500 -trim  pageout.jpg

convert page-000.jpg -shave 10x10 -bordercolor white -border 100x100 -fuzz 25500 -trim  pageout.jpg

For autocropping:
https://github.com/polm/ndl-crop

apt-get install python3-opencv


Kraken OCR?

page_dewarp (requires opencv):
https://github.com/mzucker/page_dewarp