Perform batch OCR processing on your images and screenshots, and gather the text into one formatted file with Gemini 1.5 Flash.
Ever take lots of screenshots or bookmarks to reference later but never actually do? This tool is intended to make processing text from those images and screenshots much easier.
Gemini 1.5 Flash was chosen for the job since it greatly outperforms open-source OCR solutions like Tesseract and EasyOCR, but is quicker and free/cheaper relative to many other multimodal LLMs.
- Batch OCR processing of images
- Easy to customize system prompt
.jsonoutput by default, with optional formatted.mdand.docxfile types
You will need to have the following:
- Clone the repo
git clone https://github.com/yuandere/screenshotscribe
- Create a
.envfile in the project directory and add your API keyGEMINI_API_KEY = XXXXXX - Move any images you want processed into the folder /images_to_process
- Run
optionally add the -t flag to specify output file type: (j)son, (m)arkdown, or (d)ocx
uv run screenshotscribeuv run screenshotscribe -t m
Contributions are welcome. Feel free to create a pull request or submit an issue for new features, fixing bugs, or improving documentation.
Distributed under the MIT License. See LICENSE.txt for more information.