This is a private, category specific search engine designed for personal use. It crawls, indexes, and serves search results for exactly five categories: Technology, Business, AI, Sports, and Politics.
The system operates on a daily refresh cycle, ensuring all indexed data is fresh (maximum 30 days old) and relevant. It is built for local development and testing, with Google Drive serving as the canonical data store.
The system follows a pipeline architecture with distinct phases:
Seed URLs → Normalize → Schedule → Fetch → Parse → Index → Search
Advanced Crawler Engine ⭐ NEW
Database-backed crawling system with intelligent scheduling, robots.txt compliance, and global URL deduplication. Features:
- URL Normalization: SHA256-based deduplication across all categories
- Robots.txt Service: 24-hour cached compliance with crawl-delay support
- Crawl Scheduler: Priority-based scheduling with depth and freshness factors
- Inverted Index: Database-backed TF-IDF/BM25 scoring for fast retrieval
Laravel Orchestration Layer
Laravel serves as the orchestration framework, providing job queuing, scheduling, and API routing. Business logic resides in dedicated service classes.
- Enhanced Search Core: Database-backed BM25 ranking with freshness and link popularity boosts
- Intelligent Queries: Supports logical operators (
AND,OR,NOT), exact phrase matching (""), and automatic synonym expansion (e.g., AI → ML) - Rich Results: Result highlighting, confidence scores (0-1), match scores (1-10), and query suggestions
- Advanced Filtering: Filter by category, date range (
from_date,to_date), and custom sorting (relevance,date_desc)
Crawler Service
Implements polite, ethical web crawling with robots.txt compliance (via RobotsTxtService), per-domain rate limiting, and comprehensive HTTP validation. Handles 429, 5xx, redirects, and timeouts gracefully. Respects crawl-delay directives.
Parser Service
Extracts structured data from raw HTML including title, canonical URL, meta description, and publish date. Uses UrlNormalizerService for global duplicate detection via SHA256 hashes across all categories.
Index Engine Service ⭐ NEW
Builds inverted index with tokenization, stopword removal, and optional stemming. Stores tokens and postings in database for efficient BM25 scoring.
Storage Service
Manages JSON file lifecycle including generation, validation, upload to Google Drive, and integrity verification via checksums. Google Drive serves as optional backup.
Search API
Versioned REST API secured by Laravel Sanctum and Master API Key. Serves search results from database-backed inverted index with BM25 scoring.
Cache Manager
Maintains local cache by downloading and merging ALL relevant index files from Google Drive for each category (legacy support).
The entire system lifecycle is orchestrated via a single master command that runs sequentially:
Crawl → Process → Index → Upload → Cache Refresh
This can be triggered manually via php artisan master:refresh or automatically via the daily scheduler.
-
Crawl Phase (00:00 - 02:00)
- Queue crawl jobs for seed URLs across all categories
- Respect robots.txt and enforce rate limiting
- Validate HTTP responses and content types
- Store raw HTML temporarily
-
Parse Phase (02:00 - 03:00)
- Extract structured data from raw HTML
- Normalize URLs and detect duplicates
- Global Duplicate Check: Skips any URL or content hash already existing in any category.
-
Index Phase (03:00 - 04:00)
- Fetch existing records from Google Drive
- Merge with new local records and deduplicate
- Enforce 30-day age limit (records > 30 days are purged)
- Generate timestamped JSON files if an index for today already exists
-
Cleanup Phase (04:00 - 04:30)
- Remove temporary crawl data
- Purge old index files
-
Upload Phase (04:30 - 05:00)
- Upload validated JSON to Google Drive
- Verify upload integrity via checksums
- Skip upload if no new unique records were found
-
Cache Refresh Phase (05:00 - 05:30)
- Download and merge all valid JSON files from Drive per category
- Update local cache atomically
-
Serve Phase (Always Active)
- API endpoints serve search results from local cache (Sanctum or Master Key required)
- Handle stale or missing data gracefully
- Maximum data age: 30 days
- Minimum records per category: 5 (configurable)
- Index files older than 30 days are automatically purged
- Google Drive is the source of truth
The system supports exactly five categories. This list is immutable:
- Technology - Software, hardware, programming, tech industry news
- Business - Finance, markets, entrepreneurship, corporate news
- AI - Artificial intelligence, machine learning, AI research and applications
- Sports - All sports news, events, and analysis
- Politics - Political news, policy, elections, government
Each category must maintain a minimum record count (default: 5). If this threshold cannot be met, the system logs failure and does not upload incomplete data.
The system is secured using two primary methods:
- Laravel Sanctum: Used for the Search UI. Users must log in via a secure modal. Tokens are managed via session-based cookies or local storage.
- Master API Key: Used for cross-service authentication. Accessible via
X-API-MASTER-KEYheader,Authorization: Bearertoken, orapi_master_keyquery parameter.
- Crawling: Maximum 1 request per second per domain
- API: 60 requests per minute per IP address
- No user tracking or analytics
- No external service dependencies except Google Drive
- All crawled data is publicly available web content
- Google Drive credentials stored in
.envfile (not committed to version control) - No production secrets used in local development
JavaScript Heavy Sites
The crawler does not execute JavaScript. Sites that render content dynamically via JavaScript will not be indexed correctly. This is a known tradeoff for performance and simplicity.
Crawl Coverage
With rate limiting and polite crawling, achieving 1000+ records per category per day requires a substantial seed URL list. Initial setup may require manual curation of seed URLs.
Google Drive Dependency
The system is entirely dependent on Google Drive for persistent storage. Google Drive outages will prevent uploads but will not affect search serving from local cache.
No Real Time Updates
The system operates on a daily refresh cycle. Content published between refresh cycles will not be available until the next cycle completes.
Local Development Only
The system is designed for local development and testing. Production deployment would require additional hardening, monitoring, and infrastructure.
Manual Seed URL Management
Seed URLs must be manually curated and maintained. There is no automatic discovery of new sources.
No Failure Notifications
The system logs failures but does not send notifications. Monitoring must be manual or via log aggregation.
Do not use this system if you need:
- Real time or near real time search results
- JavaScript rendered content indexing
- Automatic source discovery
- Production grade reliability and monitoring
- Multi user access with authentication
- Categories beyond the five defined categories
- Data retention beyond 30 days
- Guaranteed minimum record counts (system may fail to meet 1000 record threshold)
This system supports two authentication methods for Google Drive:
- Create a Service Account in Google Cloud Console.
- Download the JSON key and save it to
storage/app/credentials/service-account.json. - Share your target Google Drive folder with the Service Account email (Editor access).
- Configure
GOOGLE_DRIVE_SERVICE_ACCOUNT_JSONin.env.
- Create "Desktop" OAuth Client in Google Cloud Console.
- Save credentials to
storage/app/credentials/client_secret.json. - Run
php artisan google-drive:authorizeto log in via browser. - Configure
GOOGLE_DRIVE_CLIENT_SECRET_JSONandGOOGLE_DRIVE_TOKEN_JSONin.env.
If you are developing locally and encounter SSL certificate errors (e.g., cURL error 60), add this to your .env:
GOOGLE_DRIVE_VERIFY_SSL=falseRun these commands to set up the environment and database:
cp .env.example .env
php artisan key:generate
touch database/database.sqlite
php artisan vendor:publish --provider="Laravel\Sanctum\SanctumServiceProvider"
php artisan migrate --seed --class=CreateUserSeederRun the entire lifecycle sequentially:
php artisan master:refreshOR
Start the Fresh Unbreakable Refresh:
php artisan master:refresh --freshIf you prefer to run phases manually:
-
Authorize (First time only):
php artisan google-drive:authorize
-
Crawl: Start the daily crawl for all categories, or a specific one.
# All categories php artisan crawl:daily # Specific category php artisan crawl:category technology
php artisan master:refresh [--async] [--fresh]Orchestrates the entire crawl-parse-index-upload-cache cycle. Use --async for background execution, --fresh to wipe existing data.
Schedule Crawls
php artisan crawler:schedule [--reprioritize] [--cleanup]Schedule URLs for crawling based on priority and freshness. Use --reprioritize to recalculate all URL priorities, --cleanup to remove stale queue entries.
Monitor Crawler Health
php artisan crawler:monitor [--hours=24]Display comprehensive crawler health dashboard with database statistics, performance metrics, and HTTP status distribution.
Daily Crawl
php artisan crawl:dailyQueue crawl jobs for seed URLs across all categories.
Generate Index
php artisan index:generate [--category=technology]Generate search index from parsed records. Optionally specify category.
Upload to Google Drive
php artisan upload:indexUpload generated indexes to Google Drive (optional backup).
Refresh Local Cache
php artisan cache:refreshDownload and merge all indexes from Google Drive into local cache.
Trigger Refresh via API
curl -X POST http://localhost:8000/api/v1/trigger-refresh \
-H "Authorization: Bearer YOUR_TOKEN"Search via API
curl "http://localhost:8000/api/v1/search?q=artificial+intelligence&category=ai" \
-H "Authorization: Bearer YOUR_TOKEN"- DEPLOYMENT.md: Detailed setup for OAuth and environment.
- API.md: REST API documentation.
- RULES.md: Core system constraints and category definitions.
This is a private project for personal use. No license is granted for redistribution or commercial use.