Data and code for "Fast Data Applications with Spark and Python"
This repository contains materials for the District Data Labs course "Fast Data Applications with Python". Note that this repository is updated regularly, so please keep an eye on it for changes in the future!
Data can be downloaded from Dropbox at the following links:
Download for workshop
Download when needed
In this section we will (eventually) attribute all data sources and materials used in this course.
- The DBLP data set is composed of a partial snippet of authorship from the DBLP Computer Science Bibliography
- Lahman591 is a dataset of baseball statistics that we downloaded from a Hortonworks Hive example.
- The Ontime Flights dataset is wrangled heavily from the US DOT Bureau of Statistics.
- War and Peace is from Project Gutenberg
- The Wine Quality data set is from the UCI Machine Learning Repository.
- The World Bank data set is from .
- The Olympics data set is generated from DBPedia and we got it from Amol Deshpande at the University of Maryland.
- The NBA Players dataset was generated by Tony Ojeda from statistics on the web.
- Publications, Twitter Meta, and Web Clicks are all from CNets Web 2014 Data Science Challenge.
The image used at the top of this README, "Data Center" by Andrew Tseng is licensed by CC BY-NC-SA 2.0.
