Improve readme documentation on how to provide a new crawler

[This /CONTRIBUTING.md](https://github.com/google/corpuscrawler/blob/master/CONTRIBUTING.md) is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial. 

In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.

### Wanted
If an user want to add a language such as Catalan from Barcelona (`ca`, `cat` : missing). What do he needs to jump in quickly ? What should he provide ?
* What isn the local structure : 
  * [`util.py`](https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/util.py) : store functions uses by multiple languages crawlers
  * [`main.py`](https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/main.py) : stores the 1000+ crawlers calls, run them all.
  * `crawl_{iso}.py` : stores language-specific copora's source url and processing functions. 
     * [`crawl_ca_valencia.py`](https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/crawl_ca_valencia.py)
* What tools I have : 
  * [lists of available modules](https://github.com/google/corpuscrawler/blob/master/Lib/corpuscrawler/util.py#L17-L33)
  * API of key functions
* What input(s) : python list of url ?
* What are the classic parts of a crawler function ?
* What output format : raw text ? html is fine because a html balise wiper is then used ?
* Example of easily hackable base-code.

### API (to complete)
Defined functions within `util.py`, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.

Some tools
* `daterange(start, end)`: __
* `urlpath(url)`: __
* `urlencode(url)`: __

Main element
* `class Crawler(object):`
  * `__init__(self, language, output_dir, cache_dir, crawldelay)`: __
  * `get_output(self, language=None)`: __
  * `close(self)`: __
  * `fetch(self, url, redirections=None, fetch_encoding='utf-8')`: __
  * `fetch_content(self, url, allow_404=False)`: __
  * `fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True)`: __
  * `is_fetch_allowed_by_robots_txt(self, url)`: __
  * `crawl_pngscriptures_org(self, out, language)`: __
  * `_find_urls_on_pngscriptures_org(self, language)`: __
  * `crawl_abc_net_au(self, out, program_id)`: __
  * `crawl_churchio(self, out, bible_id)`: __
  * `crawl_aps_dz(self, out, prefix)`: __
  * `crawl_sverigesradio(self, out, program_id)`: __
  * `crawl_voice_of_america(self, out, host, ignore_ascii=False)`: __
  * `set_context(self, context)`: __

Some crawlers for multi-languages sites
* `crawl_bbc_news(crawler, out, urlprefix)`: __
* `crawl_korero_html(crawler, out, project, genre, filepath)`: __
* `write_paragraphs(et, out)`: __
* `crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False)`: __
* `crawl_radio_free_asia(crawler, out, edition, start_year=1998)`: __
* `crawl_sputnik_news(crawler, out, host)`: __
* `crawl_udhr(crawler, out, filename)`: __
* `crawl_voice_of_nigeria(crawler, out, urlprefix)`: __
* `crawl_bibleis(crawler, out, bible)`: __
* `crawl_tipitaka(crawler, out, script)`: __
* `find_wordpress_urls(crawler, site, **kwargs)`: __

Some cleaners
* `unichar(i)`: __
* `replace_html_entities(html)`: __
* `cleantext(html)`: __
* `clean_paragraphs(html)`: __
* `extract(before, after, html)`: __
* `fixquotes(s)`: __

### Shorter way to do so
In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.

@sffc, @brawer : anyone could help on that ? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve readme documentation on how to provide a new crawler #80

Wanted

API (to complete)

Shorter way to do so

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve readme documentation on how to provide a new crawler #80

Description

Wanted

API (to complete)

Shorter way to do so

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions