A microservices-based news article aggregation and summarization system that collects articles from RSS feeds, scrapes their content, and generates AI-powered summaries using Large Language Models.
- RSS Feed Aggregation: Collects articles from multiple news sources (Times of India, The Hindu, India Today, BBC)
- Web Scraping: Extracts full article content from URLs
- AI-Powered Summarization: Uses Facebook's BART model to generate concise article summaries
- Message Queue Processing: Asynchronous processing using RabbitMQ
- Database Storage: PostgreSQL for persistent storage
- Modular Architecture: Microservices design with independent services
The system consists of three main microservices that work together:
RSS Feeds β Database β Queue β Scraper β Queue β LLM Summarizer β Database
- RSS Service (kalinga): Aggregates RSS feeds and stores article metadata
- Scraping Service (bundelkhand): Fetches full article content from URLs
- Summarization Service (amarkantak): Generates AI-powered summaries
- Language: Python 3.x
- Database: PostgreSQL (via SQLAlchemy)
- Message Queue: RabbitMQ
- ML/AI: Transformers (HuggingFace), PyTorch, BART model
- Web Scraping: Scrapy, BeautifulSoup4
- Database Migrations: Alembic
- Containerization: Docker Compose
- Python 3.8+
- PostgreSQL 15+
- RabbitMQ
- Docker and Docker Compose (optional, for containerized setup)
uvpackage manager (or pip/venv)
-
Clone the repository
git clone <repository-url> cd project_skim
-
Create a virtual environment
make create_venv # or manually: python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies
pip install -r requirements.txt
-
Set up environment variables
Create a
.envfile in the root directory:DATABASE_URL=postgresql://postgres:postgres@localhost:5432/skim
-
Start infrastructure services
Start PostgreSQL:
docker-compose -f docker-compose-db.yml up -d
Start RabbitMQ:
docker-compose -f docker-compose-msg-queue.yml up -d
-
Run database migrations
make apply-db # or manually: alembic upgrade head
The database connection is configured via the DATABASE_URL environment variable:
postgresql://<user>:<password>@<host>:<port>/<database>
The LLM model settings are in config/constants.py:
- Model:
facebook/bart-large-cnn - Token size: 1024
- Chunk size: 300
- Device: Auto (GPU if available, else CPU)
Configure RSS feed URLs in rss_feeds/config/feed_urls.py:
- Times of India
- The Hindu
- India Today
- BBC News
Queue names are defined in config/config.py:
rss_to_scraping: Queue from RSS service to scraperscraping_to_summmarisation: Queue from scraper to summarization service
- Management UI: http://localhost:15672
- Default credentials:
admin/admin - Port: 5672
Run RSS feed aggregation service:
make run-rss
# or manually:
python main.py kalingaRun scraping service:
make run-scrap
# or manually:
python main.py bundelkhandRun summarization service:
make run-summ
# or manually:
python main.py amarkantakmake run-all
# or manually:
python main.py mahabharatGenerate a new migration:
make gen-db NAME="description_of_change"
# or manually:
alembic revision --autogenerate -m "description_of_change"Apply migrations:
make apply-db
# or manually:
alembic upgrade headUpdate requirements.txt:
make gen-reqproject_skim/
βββ article_extractors/ # Article extraction modules
βββ config/ # Configuration files
β βββ config.py # Service and queue names
β βββ constants.py # Model configuration and prompts
β βββ env.py # Environment variable handling
βββ curncher/ # Task management
βββ database/ # Database related code
β βββ models/ # SQLAlchemy models
β βββ repository/ # Data access layer
β βββ connection.py # Database connection handler
βββ llm_explorer/ # LLM summarization service
β βββ main.py # Service entry point
β βββ model_handler.py # Model loading and inference
β βββ helpers.py # Utility functions
βββ migrations/ # Alembic database migrations
βββ msg_queue/ # Message queue handlers
βββ rss_feeds/ # RSS feed aggregation service
β βββ parsers/ # RSS feed parsers for different sources
β βββ core/ # Core aggregation logic
β βββ main.py # Service entry point
βββ scraper/ # Web scraping service
β βββ pre_processing/ # Article preprocessing modules
β βββ main.py # Service entry point
βββ docker-compose-db.yml # PostgreSQL Docker setup
βββ docker-compose-msg-queue.yml # RabbitMQ Docker setup
βββ main.py # Main entry point for all services
βββ Makefile # Convenience commands
βββ requirements.txt # Python dependencies
- Fetches articles from configured RSS feeds
- Parses feed data using source-specific parsers
- Stores article metadata in the
raw_articlestable - Publishes articles to the scraping queue
- Consumes articles from the RSS queue
- Scrapes full article content from URLs
- Handles source-specific preprocessing (e.g., TOI preprocessing)
- Stores article data in the
summarized_articlestable - Publishes article body to the summarization queue
- Consumes articles from the scraping queue
- Uses BART model to generate summaries
- Handles long articles by chunking when necessary
- Updates articles in the database with summaries
-
raw_articles: Stores initial RSS feed article metadata
- id, title, article_url, source, image_url, published_date, processed
-
summarized_articles: Stores scraped articles with summaries
- id, title, article_url, source, body, img_src, published_date, category_id, raw_article_id
-
article_category: Categories for articles
- id, name, logo_src, description
Required environment variables:
DATABASE_URL: PostgreSQL connection string
- The system processes articles asynchronously through message queues
- Long articles are automatically chunked before summarization
- The BART model requires sufficient GPU memory for optimal performance
- All services log their activities for debugging and monitoring
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests (if available)
- Submit a pull request
[Add your license information here]