blockchains_web_scraping

This is a summer scholar project by Dylan Kumar at Victoria University of New Zealand. Supervisors: Jean-Grégoire Bernard & Markus Luczak-Roesch

Description of Project: The objective of this summer scholar project is to assemble a comprehensive dataset on the emergence and spread of blockchain technology since 2008. Blockchain technology is the distributed ledger system that underlie Bitcoin, a decentralized virtual currency which has gone from concept in an 2008 anonymous white paper to a market circulation valued at $46 billion USD nowadays. Despite its dodgy origins on the dark web, blockchain technology has spawned an impressive ecosystem of investments from start-ups, well-regarded corporations and governments. The dataset assembled from the digital footprints of blockchain technology communities, investment databases, media archives, the web and social media (Twitter, Wikipedia) will provide the empirics of a research programme on the social acceptance and diffusion of blockchain technology as an innovation.

Using this dataset, a programme of studies will be conducted to answer the broader research question that motivates this project: how do socially contested innovations gain legitimacy and spread over time in institutional fields? We will target top-ranked journals in information systems, entrepreneurship, or organizational sciences depending on the studies’ objectives and theoretical relevance. One study is planned to identify the frames employed to discuss blockchain technology over time by field actors, and track the critical junctures related to changes in the frames’ content and prominence. In another study, we will identify the legitimation tactics employed by blockchain entrepreneurs and examine the success of these tactics in attracting visibility and securing resources. (5) Classroom applications. Subparts of the dataset could eventually be used in the classroom in big data-related INFO, MIM, & MBusAn courses to demonstrate computational analysis techniques.

Technical Guide

Note: Github Scraper using Scrapy and Selenium is depreciated, use Github_API scraper instead (in github folder). The Factivia Scraper doesn't really work fully. "fact2" is the name of the main script.

The forum scraper for bitcointalk.org is in bitcoin_talk_forum_scraper, whilst all other forum scrapers in forum_scraper.

You can see the scripts in the spiders folder. e.g. blockchains_web_scraping/forum_scraper/forum_scraper/spiders/ for all the scripts for the different forums.

Please note that the minimum value that data goes to on the time series graphs (found in time series) is 1, not 0. Not sure why this is, If anyone knows feel free to message me.

Scripts are meant to be used together, e.g. the time_series_twitter graph goes over csv files of the same format of the csv files you get after running the twitterAPI scraper.

HOW TO RUN

To run a python script type If you want a process to run in linux in the background without the terminal left open, use nohup. e.g. nohup python nameofscript.py & disown https://unix.stackexchange.com/questions/3886/difference-between-nohup-disown-and

You can make scrapy scrape the page/Forum you want by calling the name specified in the script. e.g. run the command (without angle brackets): <scrapy crawl "eth"> to crawl https://forum.ethereum.org/. This is because the name variable in the file is set to "eth".

Note in the script we have: <name = "eth">

Make sure you move (cd) into the directory with the scrapy.cfg file, or else it won't be able to find the scrapy script. In Scrapy the 'name' variable holds what we call when running

I recommend using anaconda to set up a virtual environment, and calling the command from there (useful as provides a bash terminal for windows users as well). To install a package using this virtual environment, search for your package on the site and run the shown command,

e.g. for requests: https://anaconda.org/anaconda/requests

run the command: <conda install -c anaconda requests> in your terminal to install it.

Use GitHub_API_Scraper_Original.py if you want to scrape information regarding one repo (and it's forks). GitHub_API_Scraper_Build 1.1.py is an extention upon the original with added methods that either scrape every repo from a user, or ever repo from a specific topic. Read comments in files on how to edit.

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
Factivia		Factivia
Github Scraper using Scrapy and Selenium		Github Scraper using Scrapy and Selenium
Graphs		Graphs
IRC		IRC
bitcoin_talk_forum_scraper		bitcoin_talk_forum_scraper
forum_scraper		forum_scraper
github		github
reddit		reddit
LICENSE		LICENSE
README.md		README.md
dataset_readme.txt		dataset_readme.txt
site_scraper_script.py		site_scraper_script.py
twitterAPI.py		twitterAPI.py
youtube_scraper.py		youtube_scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

blockchains_web_scraping

Technical Guide

HOW TO RUN

About

Releases

Packages

Languages

License

VUW-FAIR/blockchains_web_scraping

Folders and files

Latest commit

History

Repository files navigation

blockchains_web_scraping

Technical Guide

HOW TO RUN

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages