Skip to content

Commit cf9f462

Browse files
committed
extract text from images in a given directory
1 parent 342805f commit cf9f462

File tree

2 files changed

+60
-0
lines changed

2 files changed

+60
-0
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Extract text from images in a given directory
2+
3+
## Description
4+
This script will extract the text from images in a specified directory and store the output in a given .txt file. The .txt file will contain the text contents of the images in order of their presence in the given directory.
5+
6+
## Requirements
7+
8+
`$ pip install Pillow`
9+
`$ pip install pytesseract`
10+
11+
Download and install the required tesseract.exe file here: https://osdn.net/projects/sfnet_tesseract-ocr-alt/downloads/tesseract-ocr-setup-3.02.02.exe/
12+
13+
## Steps To Execution
14+
- Fork this repo and navigate to Extract Text From Image folder in local folder
15+
- Edit `image-text.py` with the string for the images directory.
16+
- Run this code.py `$ python image-text.py`
17+
- In a short bit you'd have the .txt file with the texts extracted
18+
- Enjoy and goodluck on your freelancing copytyping jobs! (how the script idea came to be. Really couldn't type out text in TONS of image files lol)
19+
20+
## Code Output
21+
`"IMAGE_TITLE" done` for each image in directory when text extraction is complete for said image
22+
`Text extract script completed!` - at the end of the script.
23+
24+
Hit `Ctrl-C` to exit script.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
import os
2+
import pytesseract
3+
import signal
4+
import time
5+
from PIL import Image
6+
from os import closerange
7+
8+
def handler(signum, frame):
9+
print("Text extraction script exited!")
10+
exit(1)
11+
12+
signal.signal(signal.SIGINT, handler)
13+
14+
directory = os.fsencode(r"image files directory")
15+
directory_in_str = r"image files directory"
16+
17+
for file in os.listdir(directory):
18+
filename = os.fsdecode(file)
19+
if filename.endswith(".img") or filename.endswith(".jpeg") or filename.endswith(".jpg"):
20+
image = os.path.join(directory_in_str, filename)
21+
22+
# check Program Files(x86) for tesseract.exe (Windows machines)
23+
pytesseract.pytesseract.tesseract_cmd = r"tesseract.exe directory"
24+
25+
text = pytesseract.image_to_string(Image.open(image), lang="eng")
26+
with open("output.txt", "a", encoding='utf-8') as o:
27+
print(os.path.basename(image) + "\r" + os.path.basename(image) + " done")
28+
o.write('\n\n\n[NEW IMAGE]\n')
29+
o.write(image)
30+
o.write('\n')
31+
o.write(text)
32+
continue
33+
else:
34+
continue
35+
36+
print("Text extract script completed!")

0 commit comments

Comments
 (0)