This package is intended EXCLUSIVELY for academic research and demonstrative purposes. The authors have no responsibility for the use the tool will be used in and the consequences of it. Keep in mind that running this code is forbidden, is just a demonstration of how scraping works. If you decide to run it anyway, you will assume all the responsibilities for the consequences it will have.
To use the crawlers for Facebook, LinkedIn and X the installation of the Chrome browser and ChromeDriver is required. Chromedriver is a standalone server that implements the W3C WebDriver standard, which is an open source tool for automated testing of webapps.
Navigate to Google Chrome’s official website and download the appropriate version for your operating system.
If you use Ubuntu, you can instead install Chrome through the snap store. If done with this method, chromedriver should be installed automatically. You may check it by verifying if present in your machine and if present you can skip the chromedriver installation described in the next step.
Notice that this is required for most of the presented packages to work.
Open Google Chrome, click on the three vertical dots on the top right -> Help -> About Google Chrome. Note down the version number.
Visit the ChromeDriver download page and download the version that matches your Chrome version.
unzip chromedriver_win32.zip # For Windows
tar -xvf chromedriver_mac64.zip # For Mac
tar -xvf chromedriver_linux64.tar.gz # For Linux
Move the chromedriver
to /usr/bin/
or any location in your system’s PATH:
sudo mv chromedriver /usr/bin/
Tested on chromium and chromedriver v.123.0.6312.122
Creation of a new virtual environment is highly recommended, every time you a crawler for a different social.
In the home folder of a linux system:
python3 -m venv ./venv
Activate it:
source venv/bin/activate
Navigate to the folder of the social you want to work with and install the requirements of the package:
pip install -r requirements.txt
Run google-chrome
from the terminal or find Google Chrome in your applications and launch it.
chromedriver --version
This should match the Google Chrome version you noted earlier.
pip show selenium
This will display Selenium package details, confirming its installation in your virtual environment.
Each social media platform was designed with a unique architecture, which means that the methods for data extraction can vary. However, we have standardized the formatting in the parser across different platforms for ease of use, although some variations in the arguments may still occur.
When a command is issued, the program initiates by logging into the social media platform with the necessary credentials (if required) and begins scraping the specified information. Upon completion, the program generates a .json file that is compatible with a MongoDB environment, making it ready for analysis.
In case a .json file is created but not completelly parsed, for example stopped during its execution, the crawler will save the file. If there exists already a file of the desired social profile page, the crawler continues to crawl from the last post id stored in the file and it appends to the existing file. However, it has to be a valid JSON format for it to work and further validation has to be done to ensure the output file is a valid .json i.e. modify the line where the crawler started appending.
Below, you will find the more information on the parameters that can be used with each social media scraper.
- Add target date for crawling as a parameter
python main.py -u '<your_username>' -p '<your_password>' -q '<name_of_fb_page_to_query> -n <number_of_desired_items> '
You can use this package from command line, postget
will:
- Login to a session by using a fake user agent (making your scraper appear as a regular browser) and create a session.json or a previously created one by selecting the session.json
- The session.json is needed to simulate a saved login from the same device. In real life, you login to Facebook on a device once and then you can use it for a long time without logging in again;
- If you are willing to take risks or change account, you can manually delete the created session.json file and the package will run the same as the first time you booted it.
- Search for the requested profile
- Save found information in a .json file that can be used in a MongoDB environment
- Close the driver
Parameter | type | Description |
---|---|---|
username |
(str ): |
Username that will be used to access the Facebook account |
password |
(str ): |
PPassword of the Username that will be used access the Facebook account |
query |
(str ): |
Profile to be searched on Facebook |
num_posts |
(int ): |
Number of posts to scrape starting from the most recent one. Set to 3 (default value) |
The Instagram package is available in two versions tailored to specific use cases:
- Version 1: Offers detailed analysis for up to the first dozen posts, ideal for in-depth data gathering on recent content.
- Version 2: Provides essential information suitable for long-term analysis, enabling broader insights over extended periods.
For further details, please refer to the README file included in the package.
- get profile information about the current and previous work experiences
An example of command is (in the following a detailed explanation is provided):
# python3 main.py --chromedriver '</path/to/chromedriver>' --username '<your_username>' --password '<your_password>' --query '<profile_name>' --numposts <number_of_posts>
You can use this package from command line, postget
will:
- Login a new session and create a last_cookies.json or a previously created one by selecting the last_cookies.json
- The last_cookies.json is needed to simulate a saved login from the same device. In real life, you login to Linkedin on a device once and then you can use it for a long time without logging in again;
- If you are willing to take risks or change account, you can manually delete the created session.json file and the package will run the same as the first time you booted it.
- Search for the requested profile
- Save found information in a .json file that can be used in a MongoDB environment
- Close the driver
Parameter | type | Description |
---|---|---|
chromedriver |
(str ): |
Path to the chrome driver in order to automate the browser |
username |
(str ): |
Username that will be used to access the Linkedin account |
password |
(str ): |
Password of the Username that will be used access the Linkedin account |
query |
(str ): |
Profile name extracted from url to be searched on Linkedin |
num_posts |
(int ): |
Number of posts to scrape starting from the most recent one. Set to 3 (default value) |
The package used for the X social media platform builds upon postget created by alessandriniluca. We extend our gratitude to the original authors for their foundational work.
An example of command is (in the following a detailed explanation is provided):
python3 main.py --username '<your_username>' --password '<your_password>' --query '<query_to_be_performed>' --email_address '<mail_of_the_account>' --num_scrolls 10 --wait_scroll_base 3 --wait_scroll_epsilon 1 --mode 1
postget
searches for images and tweet in two different ways:
- Mode
0
, or simple search: it just grep all links of images and videos detected when scrolling - Mode
1
, or complete search: it greps also all information of tweet, such as id of the discussion and author. Notice that if you want to perform the search within two tweets' ids, it is necessary to operate in this mode.
Why keeping both? Because in mode 1
many things can go wrong, it is sufficient that one div search fails, that the entire search crashes.
You can use this package from command line, postget
will:
- log in
- search for the query according to the operating mode
- perform scrolls
- print the images and video previews links or the tweets information according to the operating mode
- save the found information in a .json file that can be used in a MongoDB environment
- close the driver
Notice that this means that a second call will imply a new login phase.
Parameter | type | Description |
---|---|---|
username |
(str ): |
Username that will be used to access the Twitter account |
password |
(str ): |
Password of the Username that will be used access the Twitter account |
query |
(str ): |
Query to be searched on Twitter |
wait_scroll_base |
(int ): |
base time to wait between one scroll and the subsequent (expressed in number of seconds, default 15) |
wait_scroll_epsilon |
(float ): |
random time to be added to the base time to wait between one scroll and the subsequent, in order to avoid being detected as a bot (expressed in number of seconds, default 5) |
num_scrolls |
(int ): |
number of scrolls to be performed, default 10 |
since_id |
(int ): |
id from which tweets will be saved (tweets with an id < with respect than this value will be discarded). If set to -1 (default value), this parameter will not be considered. Notice that this will be considered only if also max_id will be set, and will work only for search mode equal to 1 |
max_id |
(int ): |
id until tweets will be saved (tweets with an id > with respect to this value, will be discarded). Notice that this will be considered only if also max_id will be set, and will work only for search mode equal to 1 |
mode |
(int ): |
selects the operating mode, the default is 0 . |
since |
(str ): |
String of the date (excluded) from which the tweets will be returned. Format: YYYY-MM-DD , UTC time. Temporarily supported only for mode 1 . If you set also since_time, or until_time, this will be ignored. Wrong formats will be ignored |
until |
(str ): |
String of the date (included) until which the tweets will be returned. Format: YYYY-MM-DD , UTC time. Temporarily supported only for mode 1 . If you set also since_time, or until_time, this will be ignored. Wrong formats will be ignored |
since_time |
(str ): |
String of the time from which the tweets will be returned. Format: timestamp in SECONDS, UTC time. Temporarily supported only for mode 1 |
until_time |
(str ): |
String of the time until which the tweets will be returned. Format: timestamp in SECONDS, UTC time. Temporarily supported only for mode 1 |
headless |
(bool if imported, just type --headles if called from command line): |
If specified, runs the browser in headless mode. Unfortunately something changed from the first version of postget, and this is no more working. A section in the roadmap has been added for this. |
chromedriver |
(str ): |
custom path to the chromedriver. if not specified, the code will try to find automatically the path of chromedriver |
email_address |
(str ): |
email of the account, required since sometimes could be asked to insert it to verify the account |
root |
(bool if imported, just type --root if called from command line): |
If specified, adds the option --no-sandbox to the chrome options, needed to be runned in root mode. Please notice that running in root mode is not safe for security reasons. |
A couple of words on advenced filters:
since_id
andmax_id
: if one of them is not set, or set to the default value, also the other correctly set will be ignored. If correctly set, tweets with the id within[since_id, max_id]
will be saved (extremes included).- Precedences among
since_id
,max_id
,since
,until
,since_time
,until_time
:- The definition of even just one parameter among
since
, oruntil
causes the invalidation ofsince_id
andmax_id
(they simply will not be considered). - The definition of even just one parameter among
since_time
oruntil_time
causes the invalidation ofsince
anduntil
(they simply will not be considered). The same reasoning will be applied tosince_id
andmax_id
when one amongsince_time
oruntil_time
is defined.
- The definition of even just one parameter among
The web crawler uses Scrapy, an open source library used to easily crawl HTML in python. In addition, we also used a couple of other libraries to preprocess the crawled text. As we mentioned in the Virtual Environment part, all the libraries required for the project are in the requirements.txt file. (Refer to that part for instruction on how to install them)
Since there are different websites and blogs to crawl, the website needs to be specified. There's 6 different crawlers (called spiders in scrapy) respectively for each website/blog.
Blog Name | Crawler Name | Link |
---|---|---|
DAL 15 AL 25 | gazzetta | https://dal15al25.gazzetta.it/ |
SKY SPORT | sky_sport | https://sport.sky.it/argomenti/ |
To run a specific crawler you will need to run the following command in the folder of the web crawler (i.e. web/):
scrapy crawl [crawler_name]
The [crawler_name]
can be replaced by any of the crawler names specified in the table above. i.e. gazzetta and sky_sport. Add -L WARNING
to stop scrapy from printing the scraped items.
Please note that the json output of the crawler is stored in web/[crawler_name]_output.json
. Each time that specific crawler is run the file is overwritten. If you want to change the name of the file manually you can modify line 14
of web/web_crawler/pipelines.py
and change the name of the file before running the crawler.