some quick hacks using the common crawl dataset
links in metadata is an example of using hadoop streaming with a python script to extract links from the metadata set
finding names gives a quick overview of the textdata set and presents a simple NLTK app for extracting noun phrases (again python streaming)
url status codes shows how to run over the metadata set using java mapreduce to extract urls and the status codes the crawler received when crawling them