A Python web spider library and CLI utility to crawl for Japanese subtitles links in d-addicts.com.
pip3 install html5lib -r requirements.txt
from daddicts_spider import DAddictsSpider
all_sub_links = set()
delay_between_requests = 6 # optional arg to DAddictsSpider
take_at_least_n_links = 10 # optional arg to DAddictsSpider
for sub_links in DAddictsSpider(delay_between_requests, take_at_least_n_links):
print(sub_links)
all_sub_links |= sub_links
> ./daddicts_spider.py --help
usage: daddicts_spider.py [-h] [-d DELAY] [-t TAKE | -c CRAWL]
optional arguments:
-h, --help show this help message and exit
-d DELAY, --delay DELAY delay in seconds between HTTP requests
-t TAKE, --take TAKE take at least and around 'n' links. Will resume
from last point when calling the program again.
-c CRAWL, --crawl CRAWL crawl 'n' times. Will resume from last point when
calling the program again.
pip3 install nose
and then run nosetests
in the project's root directory.
Copyright (C) 2017 Carlos C. Fontes.
Licensed under the ISC License.