Google Images Scraper is a Python tool designed to scrape high-resolution images from Google Images based on provided links. It now supports multi-threading for faster scraping. This tool overcomes the limitations of some browser extensions that only download image thumbnails.
-
Clone the repository:
git clone https://github.com/jwiedeman/google-images-scraper.git
-
Navigate to the project directory:
cd google-images-scraper
-
Create the virtual environment:
python -m venv .venv
-
Activate the Virtual Environment:
# For Linux source .venv/bin/activate # For Windows Powershell .venv/Scripts/Activate.ps1 # For Windows Command Prompt .venv/Scripts/activate.bat
-
Install the required dependencies:
pip install -r requirements.txt
-
Run the scraper by executing the following command:
python main.py
This script will fetch high-resolution images from Google Images based on the provided links using multi-threading for faster scraping.
You can customize the behavior of the scraper by modifying the config.yaml
file.
sender_email
: The email address used for sending notifications.receiver_email
: The email address to receive notifications.sender_email_password
: The password for the sender's email account.send_email
: Set True or False for sending emails.
Note: If you want to use the email notifications functionality with a Gmail account, it's recommended to generate an App Password instead of using your account password.
search_queries
: List of search queries to use when scraping Google Images. You can add or remove queries as needed.
images_limit
: Set the maximum number of images to download per category. Google tends to load a maximum of 250 images, but can be lower, 200 is reccomended.
csv_downloads
: Directory to store CSV files containg the original link to each image downloaded.image_downloads
: Directory to store downloaded images.downloader.py
: Contains class to download images using multi-threading.email_service.py
: Provides functionality for email notifications (if needed).scraper.py
: The main scraper class to initiate the scraping process with multi-threading.config.yaml
: Configuration file to set up email and scraping parameters.link_saver.py
: Handles saving image links.main.py
: The main entry point for running the Google Images Scraper.
In main.py
, an instance of the Scraper
class is created as follows:
sc = Scraper(num_threads=5, show_ui=True)
-
num_threads
: You can customize the number of threads, which represents the total browser instances. More threads generally result in faster scraping, but it may increase resource usage. Adjust this value based on your system's capabilities and requirements. -
show_ui
: Theshow_ui
option determines whether Selenium runs in headless mode or not. When set toTrue
, it shows the browser UI during scraping. When set toFalse
, it runs Selenium in headless mode, which means the browser operates in the background without a visible UI. Choose the appropriate setting based on your preference and needs.
The rest of the process is straightforward:
-
Run the scraper by executing
main.py
:python main.py
-
The scraper will start fetching high-resolution images from Google Images based on the provided links and configurations, using the specified number of threads and UI visibility.
-
Monitor the scraping progress and any notifications sent via email, as configured in
config.yaml
.
Contributions to Google Images Scraper are welcome and encouraged! To contribute, follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make your changes and test thoroughly.
- Commit your changes with descriptive commit messages.
- Push your changes to your fork.
- Open a pull request, explaining the changes you've made.
This project is licensed under the MIT License.
- Image deduplication, when saving images, we now ensure the same image isnt already saved using
imagehash
. - Image saved name uses the next index of the count of images in the result folder, this avoids overwriting images with the same name on subsequent crawls.
- Removed sleeps to speed the process up, will wait for element visibility and immieditely continue.
- Updated Selectors for element interaction
- Added a new case to "scroll more" there are now 3 distinct messages we can recieve that may block scrolling unless clicked.
- ML Image augmentation export
- Export downloaded images as a YoloV[X] dataset.
- Speed up image downloading process
- Check images pre download against the downloaded links csv to avoid downloading then processing hashes for efficiency.
- Import search terms via csv
- CLI commands to clear folders, export, resume where left off
This program lets you download tons of images from Google Images. Please do not download or use any image that violates its copyright terms.