Fully processed StumbleUpon data extracted from the Wayback Machine, for an article.
/data-parsed/parsed-cleaned.csv: Final deduplicated extracted data.parsed.csv: Data before deduplication.
/data-raw/: Output ofwaybackpack, organised by timestamp and URL./samples/: Examples of the downloaded HTML, an individual StumbleUpon link, and the resulting CSV data./url-analysis/: The raw URLs fromparsed-cleaned.csv, plus their status codes usingvl.clean_stumbleupon_metadata.py: Tool to deduplicate a CSV byidfield (convertparsed.csvintoparsed-cleaned.csv).extract_stumbleupon_metadata.py: Tool to extract contents of downloaded StumbleUpon pages (convertdata-rawcontents intoparsed.csv).analyse_stumbleupon_metadata.py: Misc code to analyse the parsed data. This changes as required, full scripts available in original article.
To recreate the final output (parsed-cleaned.csv):
- Install Python dependencies (
pip install beautifulsoup4 lxml pandas) - Run Wayback Machine download script (
waybackpack http://www.stumbleupon.com/discover/toprated/ -d "/Projects/StumbleUpon-extract/data-raw") - Run parsing script (
python extract_stumbleupon_metadata.py) - Run deduping script (
python clean_stumbleupon_metadata.py)