Skip to content

A comparative collection of web-scraping approaches using BeautifulSoup, Requests, aiohttp, threading, and Scrapy, including middleware integration with HasData API and messy-HTML parsing examples.

Notifications You must be signed in to change notification settings

HasData/bs4-vs-scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Python Scrapy BeautifulSoup

Web Scraping Experiments: BeautifulSoup vs Scrapy

HasData_bannner

This repository contains multiple web scraping examples using different Python tools and techniques. It serves as a comparison of BeautifulSoup and Scrapy, along with some advanced variations like threading, asyncio, and middleware usage.

Table of Contents

  1. Project Structure
  2. Included Scrapers
  3. How to Run
  4. Notes

Project Structure


Beautiful_Soup_scrapers/
├── bs4_asyncio_aiohttp.py        # Scraper using asyncio + aiohttp + BeautifulSoup
├── bs4_requests_threads.py       # Scraper using threading + requests + BeautifulSoup
└── only_bs4_and_requests.py      # Simple scraper using requests + BeautifulSoup

Scrapy_scrapers/
├── badhtml_messy_project/        # Scrapy project for scraping quotes
├── quotes_project/               # Standard Scrapy project scraping quotes
└── quotes_project_hasdata/       # Scrapy project using HasData API middleware

banner.png                        # banner image
README.md

Included Scrapers

1. BeautifulSoup + Requests

  • only_bs4_and_requests.py
    Basic web scraping using requests and BeautifulSoup.

  • bs4_requests_threads.py
    Uses Python threading to speed up multiple requests concurrently.

  • bs4_asyncio_aiohttp.py
    Uses asyncio and aiohttp for asynchronous scraping.

2. Scrapy

  • quotes_project/
    Standard Scrapy project for scraping quotes from quotes.toscrape.com.

  • quotes_project_hasdata/
    Scrapy project enhanced with HasData API as middleware.

  • badhtml_messy_project/quotes_project/
    Scrapy project combining Scrapy and BeautifulSoup for scraping badhtml.com.

How to Run

BeautifulSoup scripts

python Beautiful_Soup_scrapers/only_bs4_and_requests.py
python Beautiful_Soup_scrapers/bs4_requests_threads.py
python Beautiful_Soup_scrapers/bs4_asyncio_aiohttp.py

Scrapy projects

Navigate to the project folder:

cd Scrapy_scrapers/quotes_project
scrapy crawl quotes -o quotes.json

For quotes_project_hasdata:

cd Scrapy_scrapers/quotes_project_hasdata
scrapy crawl quotes -o quotes_hasdata.json

For badhtml_messy_project (Scrapy + BeautifulSoup):

cd Scrapy_scrapers/badhtml_messy_project/quotes_project
scrapy crawl badhtml -o badhtml.json

Notes

  • Each scraper saves results in structured JSON format.
  • BeautifulSoup scripts are simpler and better for small projects.
  • Scrapy projects are scalable, support pipelines, middlewares, and asynchronous scraping.
  • Advanced scripts (threading, asyncio) demonstrate performance improvements for multiple requests.

📎 More Resources

About

A comparative collection of web-scraping approaches using BeautifulSoup, Requests, aiohttp, threading, and Scrapy, including middleware integration with HasData API and messy-HTML parsing examples.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages