This tool is used to detect birds in ancient manuscripts. There are two different approaches:
- Using Llava (Large Language Model with vision), as in the file llava_imgs_tcli.py. A newest and a more general implementation can be found in the file llava_bird_detector_automated.py.
- Using GroundingDINO, a state of the art object detector, as in the file groundingdino_birds.py
The second method is the one we prefer due to being faster, allowing us to modify the detection threshold, and having better Negative Predictive Value (NPV). By using it we can exclude all images without birds, and therefore save time from manually checking all images in a manuscript. Both methods have a lot of extra FPs.
An example of a TP, FP, FN, and TN image is shown below:
- Example of an image with a correct bird detection (TP)
- Example of an image with incorrect bird detections (FPs). A correct bird is also detected.
- Example of an image of a bird not detected (FN)
- Example of an image with no detections while there was no bird (TN)
This Python script is used for object detection in images, specifically for bird detection in our case. It uses a pre-trained model from the GroundingDINO project. Installation of dependencies can also be found in the project link. Model weights should be downloaded as well from the git repository above.
It returns a file named output_dino.txt
with the name of each image, its page number in the pdf, and if it contains a bird or not (yes or no). It also returns the total processing time for all images (<1sec/image).
model
: The pre-trained model loaded from a path (needs to be specified by the user).path_imgs
: The directory path where the images to be processed are stored (should be specified by user)all_imgs
: List of all images in thepath_imgs
directory.save_imgs_path
: The directory path where the annotated images will be saved.IMAGE_PATH
: The path of the current image being processed.TEXT_PROMPT
: The label for the object of interest (in our case bird).BOX_TRESHOLD
: The confidence threshold for object detection (in our case 0.4).TEXT_TRESHOLD
: The confidence threshold for displaying the label (in our case 0.4).image_source, image
: The original and processed image.
- Load the pre-trained model.
- Get the list of all images in the specified directory.
- Create a directory for saving the annotated images if it doesn't exist.
- For each image in the directory:
- Load the image.
- Predict the bounding boxes, confidence scores, and labels.
- Annotate the image with the predictions.
- Save the annotated image.
This script is meant to be run as a standalone script. It does not take any command-line arguments. All configurations are done within the script itself.
This script is created based on the issue in haotian-liu/LLaVA#540.
The files tcli and cli should be placed inside the LLaVA\llava\serve
folder (will be cloned from git below).
If errors occur, consider changing time.sleep(25)
to 30
.
It returns a file named output_llava.txt
with the name of each image, its page number in the pdf, and if it contains a bird or not (yes or no). It also returns the time it took to process one image (the final one), and the total processing time for all images (~35secs/image).
To run this script, you need to clone the LLaVA repository and set up the environment as follows:
- Clone the LLaVA repository:
git clone https://github.com/haotian-liu/LLaVA.git
- Navigate to the LLaVA directory:
cd LLaVA
- Create a new conda environment named
llava
with Python 3.10:
conda create -n llava python=3.10 -y
- Activate the
llava
environment:
conda activate llava
- Upgrade pip to enable PEP 660 support:
pip install --upgrade pip
- Install the LLaVA package:
pip install -e .
Below some important libraries that are used along with a simple explanation of them:
subprocess
: Allows you to spawn new processes, connect to their input/output/error pipes, and obtain their return codes.select
: Allows high-efficiency I/O multiplexing.fcntl
: The fcntl module performs file control and I/O control on file descriptors.errno
: Defines symbolic names for the errno numbers.selectors
: High-level I/O multiplexing.
img_path
: The path where the images are stored (needs to be specified by the user).all_imgs
: A sorted list of all images in the directory specified by img_path.all_paths
: A list of paths for each image in all_imgs.commands
: A list of commands to be executed. In this case, it's a command to run thetcli
module of thellava.serve
package with specified model path and load-4bit option.
A newest implementation of the above script that can be easily generalized to any task other than detection. With this, we can ask any question related to the image to the LLM
This script is used to extract images from a PDF file.
PyPDF2
: A Python library to extract document information and content, split documents page by page, merge documents page by page, and decrypt PDF files. This should be installed with:
pip install PyMuPDF
pdf_path
: The path to the PDF file from which images are to be extracted (needs to be specified by the user).output_dir
: The directory where the extracted images will be stored (needs to be specified by the user).