-
Notifications
You must be signed in to change notification settings - Fork 53
Description
This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.
In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.
Wanted
If an user want to add a language such as Catalan from Barcelona (ca, cat : missing). What do he needs to jump in quickly ? What should he provide ?
- What isn the local structure :
- What tools I have :
- lists of available modules
- API of key functions
- What input(s) : python list of url ?
- What are the classic parts of a crawler function ?
- What output format : raw text ? html is fine because a html balise wiper is then used ?
- Example of easily hackable base-code.
API (to complete)
Defined functions within util.py, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.
Some tools
daterange(start, end): __urlpath(url): __urlencode(url): __
Main element
class Crawler(object):__init__(self, language, output_dir, cache_dir, crawldelay): __get_output(self, language=None): __close(self): __fetch(self, url, redirections=None, fetch_encoding='utf-8'): __fetch_content(self, url, allow_404=False): __fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True): __is_fetch_allowed_by_robots_txt(self, url): __crawl_pngscriptures_org(self, out, language): ___find_urls_on_pngscriptures_org(self, language): __crawl_abc_net_au(self, out, program_id): __crawl_churchio(self, out, bible_id): __crawl_aps_dz(self, out, prefix): __crawl_sverigesradio(self, out, program_id): __crawl_voice_of_america(self, out, host, ignore_ascii=False): __set_context(self, context): __
Some crawlers for multi-languages sites
crawl_bbc_news(crawler, out, urlprefix): __crawl_korero_html(crawler, out, project, genre, filepath): __write_paragraphs(et, out): __crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False): __crawl_radio_free_asia(crawler, out, edition, start_year=1998): __crawl_sputnik_news(crawler, out, host): __crawl_udhr(crawler, out, filename): __crawl_voice_of_nigeria(crawler, out, urlprefix): __crawl_bibleis(crawler, out, bible): __crawl_tipitaka(crawler, out, script): __find_wordpress_urls(crawler, site, **kwargs): __
Some cleaners
unichar(i): __replace_html_entities(html): __cleantext(html): __clean_paragraphs(html): __extract(before, after, html): __fixquotes(s): __
Shorter way to do so
In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.