News aggregation tool with ML-based categorization, GUI interface, and multi-format export capabilities.
- Multi-source Support: Both HTML and RSS/Atom feeds
- Smart Categorization:
- Keyword-based auto-categorization (10+ categories)
- Optional ML-based categorization (scikit-learn)
- Deduplication: Automatic removal of duplicate articles
- Multiple Interfaces:
- Command-line interface (CLI)
- Graphical user interface (GUI) with tkinter
- Rich Export Options: JSON and responsive HTML reports
- Image Support: Extracts and displays article images
- Advanced Filtering: Search, filter by category/source
- Configurable Sources: Load from external JSON files
- Logging: Complete activity logging
pip install -r requirements.txtpython news_gui.pyFeatures:
- ML categorization toggle
- Real-time scraping with progress feedback
- Category filtering
- Article details with images
- Export to JSON/HTML
- Open articles in browser
from news_scraper import NewsScraper
scraper = NewsScraper()
# Scrape a single HTML source
articles = scraper.scrape_generic_news(
'https://news.example.com',
'Example News'
)
# Scrape an RSS feed
rss_articles = scraper.scrape_rss_feed(
'https://example.com/rss',
'RSS Source'
)
# Scrape multiple sources (see default_sources in the script for examples)
sources = [
{'name': 'Quantum Magazine', 'url': 'https://www.quantamagazine.org/feed/', 'type': 'rss'},
{'name': 'Artificial Lawyer', 'url': 'https://www.artificiallawyer.com/feed/', 'type': 'rss'}
]
all_articles = scraper.scrape_multiple_sources(sources)
# Filter by category
ai_news = scraper.filter_by_category('Technology')
# Search
results = scraper.search_articles('quantum')
# Export
scraper.export_to_json('news_data.json')
scraper.export_to_html('news_report.html')- Technology: AI, software, hardware, internet, blockchain, quantum computing
- Science: Research, discoveries, space exploration
- Business: Economy, finance, markets, companies
- Politics: Government, elections, policy
- Sports: All sports news and events
- Health: Medicine, wellness, healthcare
- World: International news and global affairs
- Law: Legal news, courts, justice
- Entertainment: Movies, music, TV, culture
- General: Uncategorized news
Train a custom classifier:
from news_scraper import NewsScraper
scraper = NewsScraper()
train_data = [
{"text": "AI breakthrough in technology", "category": "technology"},
{"text": "Stock market rises", "category": "business"},
# ... more training examples
]
scraper.train_ml_classifier(train_data)
# Model saved as ml_model.pklLoad existing model:
scraper.load_ml_classifier('ml_model.pkl')Create sources.json:
[
{
"name": "BBC News",
"url": "http://feeds.bbci.co.uk/news/rss.xml",
"type": "rss"
}
]Load and use:
sources = scraper.load_sources_from_file('sources.json')
articles = scraper.scrape_multiple_sources(sources)Articles automatically extract images from:
- HTML
<img>tags - RSS
<enclosure>tags - RSS
<media:content>tags
Access via:
for article in scraper.articles:
if article.image:
print(f"Image URL: {article.image}")category_keywords = {
'technology': ['quantum', 'ai', 'artificial intelligence', 'tech', 'software'],
'law': ['law', 'court', 'legal', 'justice'],
'science': ['science', 'research', 'discovery']
}{
"scraped_at": "2025-11-11T12:00:00",
"total_articles": 45,
"articles": [
{
"title": "Quantum Computing Breakthrough",
"url": "https://...",
"source": "Quantum Magazine",
"category": "Technology",
"summary": "...",
"publish_date": "2025-11-11",
"scraped_at": "2025-11-11T12:00:00"
}
]
}- Responsive, mobile-friendly design
- Statistics summary
- Grouped by category
- Clickable links
- News websites (HTML)
- Blogs
- RSS/Atom feeds
- Media Monitoring: Track news on quantum, law, and AI
- Research: Academic or industry research
- Content Curation: Custom news bulletins
- Trend Analysis: Follow emerging topics
- LegalTech: Aggregate legal technology news
- Python 3.8+
- Requests: HTTP requests
- BeautifulSoup: HTML/XML parsing
- lxml: Fast XML/HTML parsing
- scikit-learn: Machine learning categorization
- Pillow: Image processing for GUI
- tkinter: GUI framework
- JSON: Data serialization
05-news-scraper/
├── news_scraper.py # Core scraping and categorization
├── news_gui.py # GUI application
├── tests/ # Unit tests
├── memory-bank/ # Project documentation
├── example_sources.json # Example source configuration
├── example_categories.json # Example category configuration
├── requirements.txt # Dependencies
├── README.md # This file
├── LICENSE # MIT License
└── .gitignore # Git ignore rules
- ML Toggle: Enable/disable machine learning categorization
- Category Filter: Filter articles by category in dropdown
- Image Preview: View article images in detail window
- Browser Integration: Open articles directly in default browser
- Export Options: Save to JSON or HTML with file dialog
- Thread-safe: Non-blocking UI during scraping
from news_scraper import NewsScraper
scraper = NewsScraper()
articles = scraper.scrape_rss_feed('http://example.com/rss', 'Example')
print(f"Found {len(articles)} articles")# Filter by category
tech_news = scraper.filter_by_category('Technology')
# Filter by source
bbc_news = scraper.filter_by_source('BBC')
# Search articles
results = scraper.search_articles('artificial intelligence')categories = scraper.get_categories()
for category, count in categories.items():
print(f"{category}: {count} articles")scraper.category_keywords['custom'] = ['keyword1', 'keyword2']article_selectors = [
'article',
'div.custom-news-class'
]MIT License
Celal AYDIN