This repository contains multiple web scraping examples using different Python tools and techniques. It serves as a comparison of BeautifulSoup and Scrapy, along with some advanced variations like threading, asyncio, and middleware usage.
Beautiful_Soup_scrapers/
├── bs4_asyncio_aiohttp.py # Scraper using asyncio + aiohttp + BeautifulSoup
├── bs4_requests_threads.py # Scraper using threading + requests + BeautifulSoup
└── only_bs4_and_requests.py # Simple scraper using requests + BeautifulSoup
Scrapy_scrapers/
├── badhtml_messy_project/ # Scrapy project for scraping quotes
├── quotes_project/ # Standard Scrapy project scraping quotes
└── quotes_project_hasdata/ # Scrapy project using HasData API middleware
banner.png # banner image
README.md
-
only_bs4_and_requests.py
Basic web scraping using requests and BeautifulSoup. -
bs4_requests_threads.py
Uses Python threading to speed up multiple requests concurrently. -
bs4_asyncio_aiohttp.py
Usesasyncioandaiohttpfor asynchronous scraping.
-
quotes_project/
Standard Scrapy project for scraping quotes from quotes.toscrape.com. -
quotes_project_hasdata/
Scrapy project enhanced with HasData API as middleware. -
badhtml_messy_project/quotes_project/
Scrapy project combining Scrapy and BeautifulSoup for scraping badhtml.com.
python Beautiful_Soup_scrapers/only_bs4_and_requests.py
python Beautiful_Soup_scrapers/bs4_requests_threads.py
python Beautiful_Soup_scrapers/bs4_asyncio_aiohttp.pyNavigate to the project folder:
cd Scrapy_scrapers/quotes_project
scrapy crawl quotes -o quotes.jsonFor quotes_project_hasdata:
cd Scrapy_scrapers/quotes_project_hasdata
scrapy crawl quotes -o quotes_hasdata.jsonFor badhtml_messy_project (Scrapy + BeautifulSoup):
cd Scrapy_scrapers/badhtml_messy_project/quotes_project
scrapy crawl badhtml -o badhtml.json- Each scraper saves results in structured JSON format.
- BeautifulSoup scripts are simpler and better for small projects.
- Scrapy projects are scalable, support pipelines, middlewares, and asynchronous scraping.
- Advanced scripts (threading, asyncio) demonstrate performance improvements for multiple requests.
- The article with results: Scrapy vs. Beautiful Soup: The 2025 Engineering Benchmark
- Discord: Join the community
- Star this repo if helpful ⭐
