This twitter scraper use selenium to crawl data from twitter without authentication.
- Use
Redis
to deal with duplicate crawel - Use
Mysql
to store data - Use python database ORM
SQLalchemy
If you want to use latest version, install from source. To install twitter-scraper from source, follow these steps:
Linux and macOS:
git clone git@github.com:FmKnight/Selenium-Twitter-Scraper.git
cd Selenium-Twitter-Scraperhttps://github.com/FmKnight/Selenium-Twitter-Scraper
pip3 install -r requirements.txt
tweet_craweler.py
: run this py file to get specific keywords tweets.Contain following fields:
user_info_craweler.py
: run this py file to get specific user's info.Contain following fields:
- following
- followers
- change tweet duplicate detection way from user_id+time to sha256 digest of tweet content
- add logs to monitor running process
- change crawl way from one-time to time-span-based
- refactor the running process,add more condition judge
- Crawel tweets of specific keywords
- Crawel specific user's info