Skip to content

code402/warc-benchmark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

wat-benchmark

This repository acts as a Hello World for working with WARC files.

Its subfolders contain implementations that fetch a WARC file and search all captures from .com domains for a regex that detects YouTube links.

See also the blog post.

This is not bulletproof, production-ready code - I/O retries, closing resources and robust character decoding is omitted to focus on the WARC aspect of the code.

About

Sample code to grep Common Crawl WARC files in Go, Java, Node and Python.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published