A collection of web scrapers and automation scripts built with Python, designed to extract data from various platforms including e-commerce sites, social media, job boards, and more.
This repository contains multiple scraping modules for different websites and services:
- Amazon Scrapers: Extract product data using ASINs with varying complexity levels (basic, intermediate, advanced Scrapy-based)
- Automation Scripts: Automate daily tasks like Excel manipulation, email sending, PDF processing, and more
- Booking Scrapers: Scrape hotel booking data
- Indeed Job Scrapers: Extract job listings and details
- Instagram Scrapers: Collect profile and post data from Instagram
- TikTok Scrapers: Scrape video and user data from TikTok
- YouTube Scrapers: Extract channel, video, and category information
- REI Scrapers: Scrape product data from Recreational Equipment Inc.
- AliExpress Scrapers: Extract product listings and details
- Selenium Scripts: Various browser automation scripts for dynamic content
- Whisky Scrapers: Specialized scrapers for whisky-related data
- Python: Core programming language
- Scrapy: Web crawling framework
- Selenium: Browser automation for JavaScript-heavy sites
- BeautifulSoup: HTML parsing
- Django: Web framework for Instagram scraper
- Requests: HTTP library
- Google API Client: For YouTube API integration
- OpenPyXL: Excel file manipulation
- Pandas: Data processing (implied in automation scripts)
Crawlers/
├── project.py # Main project file with scraping examples
├── requirements.txt # Python dependencies
├── SECURITY.md # Security policy
├── amazon/ # Amazon product scrapers
│ ├── basic/ # Simple Amazon scraper
│ ├── intermediate/ # Medium complexity scraper
│ └── amzscrapy/ # Advanced Scrapy-based scraper
├── automation/ # Life automation scripts
│ ├── Excel Automation - Medium.ipynb
│ ├── Send Emails.py
│ └── various subfolders for different automations
├── booking/ # Hotel booking scraper
├── indeed/ # Job search scraper
├── insta/ # Instagram scraper
├── instagram/ # Django-based Instagram scraper
├── rei/ # REI product scraper
├── scrapy_ali/ # AliExpress Scrapy project
├── scrapone/ # Additional Scrapy project
├── seleniums/ # Selenium automation scripts
├── tiktok/ # TikTok scraper
├── wisky/ # Whisky data scraper
├── youtube/ # YouTube scraper
└── youtube_api/ # YouTube API integration
- Clone the repository:
git clone https://github.com/alphaoumardev/Crawlers.git
cd Crawlers- Install Python dependencies:
pip install -r requirements.txt-
For Selenium-based scrapers, ensure you have the appropriate web drivers installed (ChromeDriver, GeckoDriver, etc.)
-
For Django-based modules (like instagram/), run migrations if needed:
cd instagram
python manage.py migrateEach scraper module has its own usage instructions. Generally:
cd <module_directory>
scrapy crawl <spider_name>python <script_name>.pycd instagram
python manage.py runserverNote: Always respect website terms of service and robots.txt when scraping. Use appropriate delays and consider the legal implications of web scraping in your jurisdiction.
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
Please see SECURITY.md for information about reporting security vulnerabilities.
This project is licensed under the MIT License - see the LICENSE file for details.
This repository is for educational purposes only. Web scraping may violate the terms of service of websites. Always ensure you have permission to scrape data and comply with applicable laws and regulations. /workspaces/Crawlers/README.md