Respect robots.txt when crawling when set as True

- Option to set `respect_robots_txt` (Default should be `True` because of legal obligation in some jurisdictions)
- Fetch and Parse robots.txt (`urllib.robotparser` will helps in parsing robots.txt)
- Create crawl rule per domain
- Check URL permissions before crawling a URL
- Make sure it works when concurrent workers are fetching different domains
- Use the rules provided in robots.txt to fetch. (eg: use robots.txt `crawl-delay` if present. Check the rule before crawling a path)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Respect robots.txt when crawling when set as True #42

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Respect robots.txt when crawling when set as True #42

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions