generate-warcinfo-index

For each crawl, generate parquet which has the following fields:

warcinfo_id
warc_filename

The make all-warcinfo step runs one extractor per crawl. On the first run, the first crawl extraction finished in 1h 35m and the last in 6h 56m.

A copy of the actual index can be found on rf:/home/cc-pds/warcinfo-id.parquet

How to query

Look at the test code, test_pandas.py and test_duck.py

Updating the index

The code uses smart_open() to read the initial part of every warc, extracting the first record, which should be the warcinfo record.

The code is smart enough to not re-download anything, and runs in parallel for every crawl. It only needs about 3% of a core per extractor, but network latency slows it down to as slow as 7 hours for a single crawl. And if you are doing many crawls in parallel, the slowest one could be much slower than the fastest.

make collinfo
make all-crawls
make all-warcinfo
make parquet
make test

To add a single new crawl, edit the Makefile to change the CRAWL variable, then

make one-paths
make one-warcinfo
make parquet
make test

Install

If happy, copy to place:

make install

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
duck-lookup.py		duck-lookup.py
make-warcinfo-index.py		make-warcinfo-index.py
merge-parquets.py		merge-parquets.py
requirements.txt		requirements.txt
test_duck.py		test_duck.py
test_pandas.py		test_pandas.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

generate-warcinfo-index

How to query

Updating the index

Install

About

Uh oh!

Releases

Packages

Uh oh!

Languages

commoncrawl/cc-warcinfo-index-builder

Folders and files

Latest commit

History

Repository files navigation

generate-warcinfo-index

How to query

Updating the index

Install

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages