-
Notifications
You must be signed in to change notification settings - Fork 0
/
worknotes.txt
57 lines (38 loc) · 1.83 KB
/
worknotes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
TODO:
- /orientation endpoint is really slow! May be evaluate only couple of pages... or find faster algorithm.
- /ocr endpoints need way to create different directories based on settings (like rotate/:angle for example)
- /noteshrink endpoint needs settings differentation also.
- /extracted/images need extracted/png and extracted/tiff endpoints
- /combined there should be option to create searchable PDF with visible text but without images (original layout preserved).
- /images/pdf creating new PDF from user's own images does not work since "Refusing to work on images with alpha channel"
- add /rotate/pdf/ endpoint for rotating images in PDF (pdfjam)?
PDF OCR
mkdir images -p && pdftoppm -r 300 -cropbox 2000.pdf images/page
mkdir images -p && pdftoppm -cropbox 2000.pdf images/page
tesseract -l fin files.txt out pdf
tesseract -l fin files.txt out
convert \
page-000.jpg -threshold 60% \
-define connected-components:area-threshold=5 \
-define connected-components:mean-color=true \
-connected-components 8 \
-bordercolor white -border 100x100 -fuzz 25500 -trim pageout.jpg
convert \
page-001.jpg -colorspace Gray -threshold 60% \
-define connected-components:area-threshold=5 \
-define connected-components:mean-color=true \
-connected-components 8 \
-shave 10x10 -bordercolor white -border 10x10 -fuzz 25500 -trim pageout.jpg
convert \
page-001.jpg -colorspace Gray -shave 10x10 -bordercolor white -border 1x1 -threshold 60% \
-define connected-components:area-threshold=5 \
-define connected-components:mean-color=true \
-connected-components 8 \
-fuzz 25500 -trim pageout.jpg
convert page-000.jpg -shave 10x10 -bordercolor white -border 100x100 -fuzz 25500 -trim pageout.jpg
For autocropping:
https://github.com/polm/ndl-crop
apt-get install python3-opencv
Kraken OCR?
page_dewarp (requires opencv):
https://github.com/mzucker/page_dewarp