News Aggregator - Multi-Source News Scraper

News aggregation tool with ML-based categorization, GUI interface, and multi-format export capabilities.

Features

Multi-source Support: Both HTML and RSS/Atom feeds
Smart Categorization:
- Keyword-based auto-categorization (10+ categories)
- Optional ML-based categorization (scikit-learn)
Deduplication: Automatic removal of duplicate articles
Multiple Interfaces:
- Command-line interface (CLI)
- Graphical user interface (GUI) with tkinter
Rich Export Options: JSON and responsive HTML reports
Image Support: Extracts and displays article images
Advanced Filtering: Search, filter by category/source
Configurable Sources: Load from external JSON files
Logging: Complete activity logging

Installation

pip install -r requirements.txt

Usage

Graphical Interface

python news_gui.py

Features:

ML categorization toggle
Real-time scraping with progress feedback
Category filtering
Article details with images
Export to JSON/HTML
Open articles in browser

Python API

from news_scraper import NewsScraper

scraper = NewsScraper()

# Scrape a single HTML source
articles = scraper.scrape_generic_news(
    'https://news.example.com',
    'Example News'
)

# Scrape an RSS feed
rss_articles = scraper.scrape_rss_feed(
    'https://example.com/rss',
    'RSS Source'
)

# Scrape multiple sources (see default_sources in the script for examples)
sources = [
    {'name': 'Quantum Magazine', 'url': 'https://www.quantamagazine.org/feed/', 'type': 'rss'},
    {'name': 'Artificial Lawyer', 'url': 'https://www.artificiallawyer.com/feed/', 'type': 'rss'}
]
all_articles = scraper.scrape_multiple_sources(sources)

# Filter by category
ai_news = scraper.filter_by_category('Technology')

# Search
results = scraper.search_articles('quantum')

# Export
scraper.export_to_json('news_data.json')
scraper.export_to_html('news_report.html')

Default Categories

Technology: AI, software, hardware, internet, blockchain, quantum computing
Science: Research, discoveries, space exploration
Business: Economy, finance, markets, companies
Politics: Government, elections, policy
Sports: All sports news and events
Health: Medicine, wellness, healthcare
World: International news and global affairs
Law: Legal news, courts, justice
Entertainment: Movies, music, TV, culture
General: Uncategorized news

Advanced Features

ML-Based Categorization

Train a custom classifier:

from news_scraper import NewsScraper

scraper = NewsScraper()
train_data = [
    {"text": "AI breakthrough in technology", "category": "technology"},
    {"text": "Stock market rises", "category": "business"},
    # ... more training examples
]
scraper.train_ml_classifier(train_data)
# Model saved as ml_model.pkl

Load existing model:

scraper.load_ml_classifier('ml_model.pkl')

External Source Configuration

Create sources.json:

[
  {
    "name": "BBC News",
    "url": "http://feeds.bbci.co.uk/news/rss.xml",
    "type": "rss"
  }
]

Load and use:

sources = scraper.load_sources_from_file('sources.json')
articles = scraper.scrape_multiple_sources(sources)

Image Extraction

Articles automatically extract images from:

HTML <img> tags
RSS <enclosure> tags
RSS <media:content> tags

Access via:

for article in scraper.articles:
    if article.image:
        print(f"Image URL: {article.image}")

Auto-Categorization Example

category_keywords = {
    'technology': ['quantum', 'ai', 'artificial intelligence', 'tech', 'software'],
    'law': ['law', 'court', 'legal', 'justice'],
    'science': ['science', 'research', 'discovery']
}

Export Formats

JSON

{
  "scraped_at": "2025-11-11T12:00:00",
  "total_articles": 45,
  "articles": [
    {
      "title": "Quantum Computing Breakthrough",
      "url": "https://...",
      "source": "Quantum Magazine",
      "category": "Technology",
      "summary": "...",
      "publish_date": "2025-11-11",
      "scraped_at": "2025-11-11T12:00:00"
    }
  ]
}

HTML Report

Responsive, mobile-friendly design
Statistics summary
Grouped by category
Clickable links

Supported Sources

News websites (HTML)
Blogs
RSS/Atom feeds

Example Use Cases

Media Monitoring: Track news on quantum, law, and AI
Research: Academic or industry research
Content Curation: Custom news bulletins
Trend Analysis: Follow emerging topics
LegalTech: Aggregate legal technology news

Technologies Used

Python 3.8+
Requests: HTTP requests
BeautifulSoup: HTML/XML parsing
lxml: Fast XML/HTML parsing
scikit-learn: Machine learning categorization
Pillow: Image processing for GUI
tkinter: GUI framework
JSON: Data serialization

Project Structure

05-news-scraper/
├── news_scraper.py         # Core scraping and categorization
├── news_gui.py             # GUI application
├── tests/                  # Unit tests
├── memory-bank/            # Project documentation
├── example_sources.json    # Example source configuration
├── example_categories.json # Example category configuration
├── requirements.txt        # Dependencies
├── README.md              # This file
├── LICENSE                # MIT License
└── .gitignore            # Git ignore rules

GUI Features

ML Toggle: Enable/disable machine learning categorization
Category Filter: Filter articles by category in dropdown
Image Preview: View article images in detail window
Browser Integration: Open articles directly in default browser
Export Options: Save to JSON or HTML with file dialog
Thread-safe: Non-blocking UI during scraping

API Examples

Basic Scraping

from news_scraper import NewsScraper

scraper = NewsScraper()
articles = scraper.scrape_rss_feed('http://example.com/rss', 'Example')
print(f"Found {len(articles)} articles")

Filtering and Search

# Filter by category
tech_news = scraper.filter_by_category('Technology')

# Filter by source
bbc_news = scraper.filter_by_source('BBC')

# Search articles
results = scraper.search_articles('artificial intelligence')

Statistics

categories = scraper.get_categories()
for category, count in categories.items():
    print(f"{category}: {count} articles")

Customization

Add a Custom Category

scraper.category_keywords['custom'] = ['keyword1', 'keyword2']

Add a Custom Selector

article_selectors = [
    'article',
    'div.custom-news-class'
]

License

MIT License

Developer

Celal AYDIN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

News Aggregator - Multi-Source News Scraper

Features

Installation

Usage

Graphical Interface

Python API

Default Categories

Advanced Features

ML-Based Categorization

External Source Configuration

Image Extraction

Auto-Categorization Example

Export Formats

JSON

HTML Report

Supported Sources

Example Use Cases

Technologies Used

Project Structure

GUI Features

API Examples

Basic Scraping

Filtering and Search

Statistics

Customization

Add a Custom Category

Add a Custom Selector

License

Developer

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
example_sources.json		example_sources.json
news_gui.py		news_gui.py
news_scraper.py		news_scraper.py
requirements.txt		requirements.txt

License

celal1330/news-scraper

Folders and files

Latest commit

History

Repository files navigation

News Aggregator - Multi-Source News Scraper

Features

Installation

Usage

Graphical Interface

Python API

Default Categories

Advanced Features

ML-Based Categorization

External Source Configuration

Image Extraction

Auto-Categorization Example

Export Formats

JSON

HTML Report

Supported Sources

Example Use Cases

Technologies Used

Project Structure

GUI Features

API Examples

Basic Scraping

Filtering and Search

Statistics

Customization

Add a Custom Category

Add a Custom Selector

License

Developer

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages