Just a little project playing with common crawl
The objective of this projiect is to find all hyperlinks in a common crawl, and record the origin and destination pairs in a database
In the src repo of this project, you will find:
common_crawl_downloader.py, a script made to download a common crawl partition in thecommon-crawl-project/raw/directory.awsclimust be set up in order for this script to work.create_db.py, a script that will create a sqlite database in thecommon-crawl-project/db/directory, and create alinkstable that has columns forlink_origin,link_destination,source_archive, andis_validcolumns.count_domains.pyreads all partitions incommon-crawl-project/rawand parses 10000 web pages and outputs their links in thelinkssqlite table. This script is single thread.count_domains_distributed.pyis the distributed version ofcount_domains.pythat runs on Spark, and will process all partitions found entirely (instead of only 10000 pages)