Skip to content

News aggregation tool with ML-based categorization, GUI interface, and multi-format export capabilities.

License

Notifications You must be signed in to change notification settings

celal1330/news-scraper

Repository files navigation

News Aggregator - Multi-Source News Scraper

News aggregation tool with ML-based categorization, GUI interface, and multi-format export capabilities.

Features

  • Multi-source Support: Both HTML and RSS/Atom feeds
  • Smart Categorization:
    • Keyword-based auto-categorization (10+ categories)
    • Optional ML-based categorization (scikit-learn)
  • Deduplication: Automatic removal of duplicate articles
  • Multiple Interfaces:
    • Command-line interface (CLI)
    • Graphical user interface (GUI) with tkinter
  • Rich Export Options: JSON and responsive HTML reports
  • Image Support: Extracts and displays article images
  • Advanced Filtering: Search, filter by category/source
  • Configurable Sources: Load from external JSON files
  • Logging: Complete activity logging

Installation

pip install -r requirements.txt

Usage

Graphical Interface

python news_gui.py

Features:

  • ML categorization toggle
  • Real-time scraping with progress feedback
  • Category filtering
  • Article details with images
  • Export to JSON/HTML
  • Open articles in browser

Python API

from news_scraper import NewsScraper

scraper = NewsScraper()

# Scrape a single HTML source
articles = scraper.scrape_generic_news(
    'https://news.example.com',
    'Example News'
)

# Scrape an RSS feed
rss_articles = scraper.scrape_rss_feed(
    'https://example.com/rss',
    'RSS Source'
)

# Scrape multiple sources (see default_sources in the script for examples)
sources = [
    {'name': 'Quantum Magazine', 'url': 'https://www.quantamagazine.org/feed/', 'type': 'rss'},
    {'name': 'Artificial Lawyer', 'url': 'https://www.artificiallawyer.com/feed/', 'type': 'rss'}
]
all_articles = scraper.scrape_multiple_sources(sources)

# Filter by category
ai_news = scraper.filter_by_category('Technology')

# Search
results = scraper.search_articles('quantum')

# Export
scraper.export_to_json('news_data.json')
scraper.export_to_html('news_report.html')

Default Categories

  1. Technology: AI, software, hardware, internet, blockchain, quantum computing
  2. Science: Research, discoveries, space exploration
  3. Business: Economy, finance, markets, companies
  4. Politics: Government, elections, policy
  5. Sports: All sports news and events
  6. Health: Medicine, wellness, healthcare
  7. World: International news and global affairs
  8. Law: Legal news, courts, justice
  9. Entertainment: Movies, music, TV, culture
  10. General: Uncategorized news

Advanced Features

ML-Based Categorization

Train a custom classifier:

from news_scraper import NewsScraper

scraper = NewsScraper()
train_data = [
    {"text": "AI breakthrough in technology", "category": "technology"},
    {"text": "Stock market rises", "category": "business"},
    # ... more training examples
]
scraper.train_ml_classifier(train_data)
# Model saved as ml_model.pkl

Load existing model:

scraper.load_ml_classifier('ml_model.pkl')

External Source Configuration

Create sources.json:

[
  {
    "name": "BBC News",
    "url": "http://feeds.bbci.co.uk/news/rss.xml",
    "type": "rss"
  }
]

Load and use:

sources = scraper.load_sources_from_file('sources.json')
articles = scraper.scrape_multiple_sources(sources)

Image Extraction

Articles automatically extract images from:

  • HTML <img> tags
  • RSS <enclosure> tags
  • RSS <media:content> tags

Access via:

for article in scraper.articles:
    if article.image:
        print(f"Image URL: {article.image}")

Auto-Categorization Example

category_keywords = {
    'technology': ['quantum', 'ai', 'artificial intelligence', 'tech', 'software'],
    'law': ['law', 'court', 'legal', 'justice'],
    'science': ['science', 'research', 'discovery']
}

Export Formats

JSON

{
  "scraped_at": "2025-11-11T12:00:00",
  "total_articles": 45,
  "articles": [
    {
      "title": "Quantum Computing Breakthrough",
      "url": "https://...",
      "source": "Quantum Magazine",
      "category": "Technology",
      "summary": "...",
      "publish_date": "2025-11-11",
      "scraped_at": "2025-11-11T12:00:00"
    }
  ]
}

HTML Report

  • Responsive, mobile-friendly design
  • Statistics summary
  • Grouped by category
  • Clickable links

Supported Sources

  • News websites (HTML)
  • Blogs
  • RSS/Atom feeds

Example Use Cases

  1. Media Monitoring: Track news on quantum, law, and AI
  2. Research: Academic or industry research
  3. Content Curation: Custom news bulletins
  4. Trend Analysis: Follow emerging topics
  5. LegalTech: Aggregate legal technology news

Technologies Used

  • Python 3.8+
  • Requests: HTTP requests
  • BeautifulSoup: HTML/XML parsing
  • lxml: Fast XML/HTML parsing
  • scikit-learn: Machine learning categorization
  • Pillow: Image processing for GUI
  • tkinter: GUI framework
  • JSON: Data serialization

Project Structure

05-news-scraper/
├── news_scraper.py         # Core scraping and categorization
├── news_gui.py             # GUI application
├── tests/                  # Unit tests
├── memory-bank/            # Project documentation
├── example_sources.json    # Example source configuration
├── example_categories.json # Example category configuration
├── requirements.txt        # Dependencies
├── README.md              # This file
├── LICENSE                # MIT License
└── .gitignore            # Git ignore rules

GUI Features

  • ML Toggle: Enable/disable machine learning categorization
  • Category Filter: Filter articles by category in dropdown
  • Image Preview: View article images in detail window
  • Browser Integration: Open articles directly in default browser
  • Export Options: Save to JSON or HTML with file dialog
  • Thread-safe: Non-blocking UI during scraping

API Examples

Basic Scraping

from news_scraper import NewsScraper

scraper = NewsScraper()
articles = scraper.scrape_rss_feed('http://example.com/rss', 'Example')
print(f"Found {len(articles)} articles")

Filtering and Search

# Filter by category
tech_news = scraper.filter_by_category('Technology')

# Filter by source
bbc_news = scraper.filter_by_source('BBC')

# Search articles
results = scraper.search_articles('artificial intelligence')

Statistics

categories = scraper.get_categories()
for category, count in categories.items():
    print(f"{category}: {count} articles")

Customization

Add a Custom Category

scraper.category_keywords['custom'] = ['keyword1', 'keyword2']

Add a Custom Selector

article_selectors = [
    'article',
    'div.custom-news-class'
]

License

MIT License

Developer

Celal AYDIN

About

News aggregation tool with ML-based categorization, GUI interface, and multi-format export capabilities.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages