Skip to content

nickpsal/python_scrapper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ•ΈοΈ Universal Python Article Scraper

A flexible Python web scraper that extracts articles from Joomla, WordPress, Drupal, and JavaScript-heavy websites without requiring manual structure configuration.


πŸ“‚ Contents

  • universal_scraper.py: Scraper for static HTML pages
  • js_scraper.py: Scraper with JavaScript support using Selenium

πŸ› οΈ Installation

  1. Clone the repository:

    git clone https://github.com/nickpsal/python_scrapper.git
    cd python_scrapper
  2. Create a virtual environment and install dependencies:

    python3 -m venv venv
    source venv/bin/activate   # Linux/macOS
    .\venv\Scripts\activate    # Windows
    
    pip install requests beautifulsoup4 newspaper3k selenium webdriver-manager lxml

βš™οΈ Usage

We run python scraper.py category_url for example python scraper.py https://www.cosmopolitan.com/style-beauty/fashion

Then at first it uses the Static Scraper

In case it didn't find any articles it automatically switches to the Dynamic JS Scraper

πŸ“‘ Static Scraper (universal_scraper.py)

Used for basic HTML pages without heavy JavaScript.

πŸ“Œ Outputs all articles with articles with:

  • Title
  • URL
  • Main content

🧠 Dynamic JS Scraper (js_scraper.py)

Uses Selenium for sites that load content via JavaScript.

πŸ” How It Works

  • πŸ”— Detects potential articles by URL patterns (e.g., /2024/, slug with -)
  • πŸ“° Tries to extract content using newspaper3k
  • πŸ” Falls back to BeautifulSoup if necessary
  • πŸ’Ύ Displays title and full article text
  • πŸ’€ Adds 1-second delay to avoid overwhelming servers

βœ… Supported Websites

Tested with:

  • Joomla, WordPress, Drupal (static or SEO-friendly URLs)
  • GlamourMagazine (JS-rendered content)
  • Any site with accessible <a href="..."> article links

πŸ“¦ Requirements

requests
beautifulsoup4
newspaper4k
selenium
webdriver-manager
lxml

πŸ“œ License

MIT License Β© Nick Psal

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages