Smart-Go-WebCrawler

A high-performance, context-aware web crawler built in Go with PostgreSQL backend. Features intelligent duplicate detection, content quality analysis, and priority-based crawling.

🚀 Features

Smart Crawling

Context-Aware: Analyzes page content and structure to make intelligent crawling decisions
Priority-Based: Uses sophisticated algorithms to prioritize high-value content
Duplicate Detection: Advanced hash-based duplicate detection to avoid redundant crawling
Content Quality Analysis: Evaluates page quality using multiple metrics
Adaptive Rate Limiting: Dynamic rate limiting based on page priority and server response

Performance Optimizations

Concurrent Workers: Configurable worker pool for parallel processing
Database Optimization: Efficient PostgreSQL schema with proper indexing
Memory Management: Optimized memory usage for large-scale crawling
Connection Pooling: HTTP connection reuse for better performance

Traditional vs Smart Comparison

The project includes comprehensive benchmarking between traditional breadth-first crawling and the smart approach, typically showing:

30-50% faster crawling due to better prioritization
60-80% reduction in duplicate content processing
40-60% improvement in content quality of crawled pages
Reduced server load through smarter request patterns

🛠️ Setup (Windows)

Prerequisites

Go 1.21 or later
PostgreSQL 12 or later

Installation

Install PostgreSQL

# Download and install PostgreSQL from https://www.postgresql.org/download/windows/
# Or use Chocolatey:
choco install postgresql

Create Database

-- Connect to PostgreSQL as superuser
psql -U postgres

-- Create database and user
CREATE DATABASE smart_crawler;
CREATE USER crawler_user WITH PASSWORD 'your_password';
GRANT ALL PRIVILEGES ON DATABASE smart_crawler TO crawler_user;

Build

go mod init smart-crawler
go mod tidy

# Build the project
go build -o smart-crawler.exe main.go

Environment Configuration Create a .env file in the project root:

DATABASE_URL=postgres://crawler_user:your_password@localhost/smart_crawler?sslmode=disable
USER_AGENT=SmartCrawler/1.0
REQUEST_TIMEOUT=30
RATE_LIMIT=100

🏃‍♂️ Usage

Basic Crawling

# Smart crawling (default)
./smart-crawler.exe -url="https://example.com" -depth=3 -workers=10

# Traditional crawling
./smart-crawler.exe -mode=traditional -url="https://example.com" -depth=3 -workers=10

# Performance benchmark
./smart-crawler.exe -mode=benchmark -url="https://example.com" -depth=2 -workers=5

Command Line Options

-mode: Crawler mode (smart, traditional, benchmark)
-url: Starting URL to crawl
-depth: Maximum crawl depth (default: 3)
-workers: Number of concurrent workers (default: 10)

🏗️ Architecture

Project Structure

smart-crawler/
├── main.go              # Application entry point
├── config/             
│   └── config.go        # Configuration management
├── models/             
│   └── models.go        # Data models and structures
├── crawler/            
│   ├── traditional.go   # Traditional BFS crawler
│   └── smart.go         # Smart context-aware crawler
├── database/           
│   └── postgres.go      # PostgreSQL operations
├── utils/              
│   └── utils.go         # Utility functions
├── benchmark/          
│   └── benchmark.go     # Performance benchmarking
└── README.md

Database Schema

-- Pages table stores crawled content
pages (
    id SERIAL PRIMARY KEY,
    url TEXT UNIQUE NOT NULL,
    title TEXT,
    content TEXT,
    status_code INTEGER,
    content_type TEXT,
    size BIGINT,
    load_time_ms BIGINT,
    depth INTEGER,
    parent_url TEXT,
    crawled_at TIMESTAMP,
    hash TEXT,
    importance_score FLOAT,
    content_quality FLOAT,
    link_density FLOAT
);

-- Links table stores page relationships
links (
    id SERIAL PRIMARY KEY,
    source_id BIGINT REFERENCES pages(id),
    target_id BIGINT REFERENCES pages(id),
    url TEXT NOT NULL,
    anchor TEXT,
    rel TEXT
);

-- Crawl queue for smart crawler
crawl_queue (
    id SERIAL PRIMARY KEY,
    url TEXT UNIQUE NOT NULL,
    priority INTEGER,
    depth INTEGER,
    parent_url TEXT,
    scheduled_at TIMESTAMP,
    attempts INTEGER,
    status TEXT
);

🧠 Smart Crawler Algorithm

1. Content Analysis

Quality Scoring: Analyzes text length, structure, meta tags
Importance Calculation: Considers semantic content, navigation depth
Link Density: Evaluates ratio of links to content

2. Priority Calculation

priority = base_priority + 
           anchor_text_bonus + 
           semantic_bonus + 
           page_importance_bonus - 
           navigation_penalty

3. Duplicate Detection

Content Hashing: MD5 hash comparison for exact duplicates
Similarity Detection: Future enhancement for near-duplicate detection

4. Adaptive Rate Limiting

Priority-Based: Higher priority pages get faster processing
Server-Respectful: Dynamic delays based on server response

🚀 Performance Optimizations

Go-Specific Optimizations

Goroutine Pool: Efficient worker management
Channel-Based Communication: Non-blocking inter-goroutine communication
Connection Pooling: HTTP client reuse
Memory Management: Efficient string handling and buffer reuse

Database Optimizations

Indexed Queries: Strategic indexing on frequently queried columns
Batch Operations: Bulk inserts for better performance
Connection Pooling: Database connection reuse
Prepared Statements: Query optimization

🔧 Configuration Options

Environment Variables

DATABASE_URL=postgres://user:pass@localhost/db?sslmode=disable
USER_AGENT=SmartCrawler/1.0
REQUEST_TIMEOUT=30
RATE_LIMIT=100

Crawler Parameters

Workers: 1-50 (optimal: 5-15 for most sites)
Depth: 1-10 (optimal: 2-5 for comprehensive crawling)
Rate Limit: 1-1000 requests/minute

🛡️ Best Practices

Respectful Crawling

robots.txt: Respects robots.txt directives
Rate Limiting: Configurable delays between requests
User Agent: Clear identification in headers
Error Handling: Graceful handling of server errors

Resource Management

Memory Usage: Efficient memory management for large crawls
Connection Limits: Respects server connection limits
Timeout Management: Proper timeout handling
Graceful Shutdown: Clean shutdown on interruption

🔍 Monitoring and Logging

Built-in Metrics

Pages processed per second
Error rates and types
Memory usage statistics
Database performance metrics

Logging Levels

Info: General crawling progress
Warning: Recoverable errors
Error: Critical failures
Debug: Detailed debugging information

🚧 Future Enhancements

Planned Features

JavaScript Rendering: Headless browser integration
Machine Learning: AI-powered content classification
Distributed Crawling: Multi-node crawling support
Real-time Analytics: Live crawling dashboard
API Interface: RESTful API for remote control

Scalability Improvements

Horizontal Scaling: Multi-instance coordination
Cloud Storage: S3/GCS integration
Message Queues: Redis/RabbitMQ integration
Microservices: Service decomposition

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
benchmark		benchmark
config		config
crawler		crawler
database		database
models		models
utils		utils
README.md		README.md
go.mod		go.mod
go.sum		go.sum
main.go		main.go
sample_env_file		sample_env_file

pranav11024/smart-crawler

Folders and files

Latest commit

History

Repository files navigation