This repository is a proof-of-concept (POC) for detecting duplicate images using facial recognition techniques. It leverages DeepFace for face detection and feature embedding, and provides options for both single-process and multi-process deduplication. Celery integration allows for distributed task processing, making it suitable for larger datasets.
-
Face Recognition:
- Detect faces in images.
- Generate feature embeddings using DeepFace.
-
Duplicate Detection:
- Compare embeddings to identify duplicate or similar images.
- Flexible backend and model configurations.
-
Multi-Process Support:
- Local execution with single or multiple processes.
- Distributed task processing using Celery.
-
Performance Metrics:
- Tracks encoding and deduplication times.
- Provides a comprehensive HTML report.
uv venv
uv syncexport DEEPFACE_HOME=$PWDThis sets the working directory as the DeepFace home, ensuring all necessary models and configurations are correctly accessed.
Prepare a directory with images (e.g., data/IMAGES) and run:
dedupe data/IMAGES -p 1For improved performance on larger datasets, specify the number of processes (e.g., 4):
dedupe data/IMAGES -p 4In the first terminal, start a Celery worker:
watchmedo auto-restart --directory=./src/ --pattern *.py --recursive -- celery -A recognizeapp.c.app workerIn a second terminal, run the deduplication task with Celery:
dedupe data/IMAGES -p 4 --queueFlower provides a web interface to monitor Celery workers and tasks.
In the first terminal:
watchmedo auto-restart --directory=./src/ --pattern *.py --recursive -- celery -A recognizeapp.c.app flowerOpen your browser and navigate to:
http://localhost:5555
The project uses DeepFace for face detection and embedding generation. Supported models and backends include:
- Models:
VGG-Face,Facenet,DeepFace,ArcFace, and others. - Backends:
opencv,mtcnn,retinaface, and more.
The DeepFace.represent() function generates 128-dimensional feature vectors for each detected face, which are then compared to identify duplicates.
-
Encoding:
- Images are processed to extract face embeddings.
- Images without detectable faces are flagged as
NO_FACE_DETECTED.
-
Comparison:
- Feature vectors are compared using cosine similarity or distance-based metrics.
- Duplicate pairs are identified if the similarity exceeds a defined threshold.
-
Report Generation:
- Results are saved in a JSON format and an HTML report is generated.
- Metrics like total time, new images processed, and findings are included.
- For local execution, Python's
multiprocessingmodule is used to parallelize encoding and deduplication. - Distributed task execution with Celery allows for scaling across multiple machines.
-p, --processes: Number of processes to use (default: CPU count).--queue: Use Celery for distributed task processing.--reset: Reset findings and encodings before processing.--report: Generate an HTML report after deduplication.--model-name: Specify the model name to use (e.g.,VGG-Face,ArcFace,Facenet, ...).--detector-backend: Specify the model name to use (e.g.,retinaface,mtcnn, ...).
-
NO_FACE_DETECTEDfor Valid Images:- Ensure the correct model is specified (e.g.,
VGG-Face,ArcFace. Default to VGG-Face) or the correct backend. - Try enabling
enforce_detection=FalseinDeepFace.
- Ensure the correct model is specified (e.g.,
-
Celery Worker Not Starting:
- Check if
watchmedois installed:uv sync. - Verify Celery configurations.
- Check if
-
Performance Issues:
- Use multiple processes for large datasets.
- Chose light models (but loose on the accuracy)
- Add support for storing encodings in databases (e.g., postgres, redis).
- Integrate GPU acceleration for faster embedding generation.
- Extend reporting capabilities with more detailed analytics.