Skip to content

Improve readme documentation on how to provide a new crawler #80

@hugolpz

Description

@hugolpz

This /CONTRIBUTING.md is a License Agreement / Code of Conduite to sign. As far as I can see, this very cherishable project has no actual tutorial.

In don't have Python and code knowledge to fix this documentation issue but I can map the road so it become easier for the next person to do so.

Wanted

If an user want to add a language such as Catalan from Barcelona (ca, cat : missing). What do he needs to jump in quickly ? What should he provide ?

  • What isn the local structure :
    • util.py : store functions uses by multiple languages crawlers
    • main.py : stores the 1000+ crawlers calls, run them all.
    • crawl_{iso}.py : stores language-specific copora's source url and processing functions.
  • What tools I have :
  • What input(s) : python list of url ?
  • What are the classic parts of a crawler function ?
  • What output format : raw text ? html is fine because a html balise wiper is then used ?
  • Example of easily hackable base-code.

API (to complete)

Defined functions within util.py, by order of apparition as of 2021/02/26. If you have some relevant knowledge, please help for a sub-section or one item.

Some tools

  • daterange(start, end): __
  • urlpath(url): __
  • urlencode(url): __

Main element

  • class Crawler(object):
    • __init__(self, language, output_dir, cache_dir, crawldelay): __
    • get_output(self, language=None): __
    • close(self): __
    • fetch(self, url, redirections=None, fetch_encoding='utf-8'): __
    • fetch_content(self, url, allow_404=False): __
    • fetch_sitemap(self, url, processed=set(), subsitemap_filter=lambda x: True): __
    • is_fetch_allowed_by_robots_txt(self, url): __
    • crawl_pngscriptures_org(self, out, language): __
    • _find_urls_on_pngscriptures_org(self, language): __
    • crawl_abc_net_au(self, out, program_id): __
    • crawl_churchio(self, out, bible_id): __
    • crawl_aps_dz(self, out, prefix): __
    • crawl_sverigesradio(self, out, program_id): __
    • crawl_voice_of_america(self, out, host, ignore_ascii=False): __
    • set_context(self, context): __

Some crawlers for multi-languages sites

  • crawl_bbc_news(crawler, out, urlprefix): __
  • crawl_korero_html(crawler, out, project, genre, filepath): __
  • write_paragraphs(et, out): __
  • crawl_deutsche_welle(crawler, out, prefix, need_percent_in_url=False): __
  • crawl_radio_free_asia(crawler, out, edition, start_year=1998): __
  • crawl_sputnik_news(crawler, out, host): __
  • crawl_udhr(crawler, out, filename): __
  • crawl_voice_of_nigeria(crawler, out, urlprefix): __
  • crawl_bibleis(crawler, out, bible): __
  • crawl_tipitaka(crawler, out, script): __
  • find_wordpress_urls(crawler, site, **kwargs): __

Some cleaners

  • unichar(i): __
  • replace_html_entities(html): __
  • cleantext(html): __
  • clean_paragraphs(html): __
  • extract(before, after, html): __
  • fixquotes(s): __

Shorter way to do so

In code comments can do a lot. Pointing to wisely chosen sections too. If you have the required know how, please add comments onto a chosen, existing crawler and point to it as an in-code tutorial.

@sffc, @brawer : anyone could help on that ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions