Skip to content

Respect robots.txt when crawling when set as True #42

@indrajithi

Description

@indrajithi
  • Option to set respect_robots_txt (Default should be True because of legal obligation in some jurisdictions)
  • Fetch and Parse robots.txt (urllib.robotparser will helps in parsing robots.txt)
  • Create crawl rule per domain
  • Check URL permissions before crawling a URL
  • Make sure it works when concurrent workers are fetching different domains
  • Use the rules provided in robots.txt to fetch. (eg: use robots.txt crawl-delay if present. Check the rule before crawling a path)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions