Python library to scrape images in parallel from Flickr based on given list of locations like rome, paris and so on. It extracts the filename and geo information about the images and inserts into SQLite database. In case of missing geo information, it uses Bing Maps API to extract this information based on the generic location (example, paris) that was searched.
- Python3
- Pip3: python3 get-pip.py
- API keys: Get the Flickr and Bing Maps API keys from below links and insert it into scrape-flickr/config.py
The following tables get created in this code:
- image_metadata: used to store image information such as filename and geo information and consists of following fields:
- id: unique image id
- filename: title of the image
- latitude: latitude of the location in the image
- longitude: longitude of the location in the image
- default_geo_info: used to store missing geo information of images using Bing Maps API
- search_text: location that was searched on Flickr
- latitude: latitude of the location
- longitude: longitude of the location
This code is tested on Mac and Windows 10.
-
Run the shell script which creates a virtual environment named scraper and installs the needed python packages
sh setup.sh
-
Activate virtualenv, if not activated already
. scraper/bin/activate
-
Run the code from the project directory py-scrape-flickr
python scrape-flickr/scrape_flickr.py paris rome "new york" [--photos_per_page] [-h]
It takes the following arguments:
- list of locations each separated by space and put double quotes around locations containing space
- optional --photos_per_page: number of photos to be retrieved at same time (max=500)
- optional -h: check usage
The database scraper.db gets created in the project folder (py-scrape-flickr) when running it for the first time.
-
Check results
sqlite3 scraper.db select * from image_metadata;
-
Time in minutes for various input sizes on a 4 processors system
- 3 locations: 16 mins
- 6 locations: 60 mins
- 10 locations: 104 mins
This time will vary depending on what locations were searched and how many images they have and also on the number of processors on the system and how strong is the internet connection.
-
Run unit tests
python -m unittest discover scrape-flickr/