A flexible Python web scraper that extracts articles from Joomla, WordPress, Drupal, and JavaScript-heavy websites without requiring manual structure configuration.
universal_scraper.py: Scraper for static HTML pagesjs_scraper.py: Scraper with JavaScript support using Selenium
-
Clone the repository:
git clone https://github.com/nickpsal/python_scrapper.git cd python_scrapper -
Create a virtual environment and install dependencies:
python3 -m venv venv source venv/bin/activate # Linux/macOS .\venv\Scripts\activate # Windows pip install requests beautifulsoup4 newspaper3k selenium webdriver-manager lxml
We run python scraper.py category_url for example python scraper.py https://www.cosmopolitan.com/style-beauty/fashion
Then at first it uses the Static Scraper
In case it didn't find any articles it automatically switches to the Dynamic JS Scraper
Used for basic HTML pages without heavy JavaScript.
π Outputs all articles with articles with:
- Title
- URL
- Main content
Uses Selenium for sites that load content via JavaScript.
- π Detects potential articles by URL patterns (e.g.,
/2024/, slug with-) - π° Tries to extract content using
newspaper3k - π Falls back to
BeautifulSoupif necessary - πΎ Displays title and full article text
- π€ Adds 1-second delay to avoid overwhelming servers
Tested with:
- Joomla, WordPress, Drupal (static or SEO-friendly URLs)
- GlamourMagazine (JS-rendered content)
- Any site with accessible
<a href="...">article links
requests
beautifulsoup4
newspaper4k
selenium
webdriver-manager
lxml
MIT License Β© Nick Psal