A minimal web spider and indexer
USE AT YOUR OWN RISK - possibly unsafe
Currently implemented:
- robots.txt disallow rules (untested)
- pagerank (needs tuning)
- lynx
- sqlite3
- libsqlite3-dev
- sqlitepipe
- fill out seed_data.sql (see sample_seed_data.sql)
- compile sqlitepipe extension
cd sqlitepipe/ && make
- initialize schema and seed_data
sqlite3 pages.db < schema.sql && sqlite3 pages.db < seed_data.sql
- initialize robots.txt for a domain (no trailing backslash!)
./robo-parse.sh https://example.com | sqlite3 pages.db
./wrapper.sh
- libsqlite3-dev
- more thorough robots.txt testing
- add in 'allow' logic
- respect 429 ratelimit responses
- respect 'crawl-delay' rules
- consolidate whitelist query in wrapper.sh w/ CTE in crawl.sql
- add full text search
- add bayesian spam filtering
- tuning of PageRank's 'alpha' parameter & iteration count
Taken from the Stack Overflow network
Answer provided by: Geng Liang