[GitHub Graph] (http://githubgraph.com/)
Work in Progress
- [Introduction] (README.md#1-introduction)
- [AWS Clusters] (README.md#2-aws-clusters)
- [Data Pipeline] (README.md#3-data-pipeline)
- [API] (README.md#4-api)
- [Front End] (README.md#5-front-end)
- [Presentation] (README.md#6-presentation)
GitHub Graph is a tool to enable developers to stay updated with the trends on GitHub that they would be interested in. It uses [GitHub Archive] (https://www.githubarchive.org/) and [GitHub API] (https://developer.github.com/v3/users/) as data sources to serve the application. Esentially, it is a big data pipeline focused on answering- "For the users I follow, what are the repositories that those users follow and contribute to".
![Example graph query] (flask/static/img/graph.png)
[GitHub Graph] (http://githubgraph.com/) also shows top 10 weekly trending repositories based on the number of stars.
![Weekly Trending repos] (flask/static/img/trending.png)
- [GitHub Archive] (https://www.githubarchive.org/): [Ilya Grigorik] (https://www.igvita.com/) started the [GitHub Archive] (https://www.githubarchive.org/) project to record the public GitHub timeline, archive it, and make it easily accessible for further analysis. It has a very simple API to collect data on the granularity of an hour. [GitHub Graph] (http://githubgraph.com/) pulled 850+ GB of data from this source ranging from December 2011 till today. These are clean nested json responses with one event on each line. Events for example are 'WatchEvent', 'ForkEvent', 'CommitCommentEvent', etc.
![Weekly Trending repos] (flask/static/img/publictimeline.png)
An example to collect acttivity for all of January 2015 from [GitHub Archive] (https://www.githubarchive.org/) is:
$ wget http://data.githubarchive.org/2015-01-{01..30}-{0..23}.json.gz
- [GitHub API] (https://developer.github.com/v3/users/): [GitHub Graph] (http://githubgraph.com/) has 12M+ GitHub usernames with their IDs from [GitHub's API's] (https://developer.github.com/v3/users/) https://api.github.com/users endpoint. Using these usernames it also collects data regarding who these users are following using https://api.github.com/users/username/following endpoint. There are over 4M such records in the database at present. Moreover, if the queried username is not in the database, records for that username are collected from the API in real time, stored in the database and results are generated.
GitHub's API has a rate limit of 5000 calls/hour and around 25 [GitHub API] (https://developer.github.com/v3/users/) access tokens were used to collect data. Thanks to my fellow fellows at Insight and friends in India.
[GitHub Graph] (http://githubgraph.com/) is powered by three clusters on AWS-
- 4 m3.large producers collecting data from GitHub's API and sending these messages to [Kafka] (http://kafka.apache.org/)
- 6 m3.xlarge for [Spark] (https://spark.apache.org/) Cluster
- 5 m3.large for [HDFS] (http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html), [Kafka] (http://kafka.apache.org/), Cassndra, [Zookeeper] (https://zookeeper.apache.org/)
![Clusters] (/flask/static/img/clusters.png)
![Pipeline] (/flask/static/img/pipeline.png)
-
The data from GitHub Archive is stored on HDFS with 4 data nodes and 1 name node.
-
3 producers are collecting data from GitHub's API and shooting messages to Kafka. These messages are consumed using [camus] (https://github.com/linkedin/camus). [Camus] (https://github.com/linkedin/camus) is a tool built by [Linkedin] (https://www.linkedin.com/) which is essentially a distributed consumer running a map reduce job underneath to consume messages from [Kafka] (http://kafka.apache.org/) and save them to HDFS.
Camus is great for the Kafka->HDFS pipeline as it keeps a track of the last offset consumed for a topic and also allows to whitelist and blacklist topics so that one can consume only a subset of topics. Moreover, it also compresses the data before saving it to [HDFS] (http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) which saves space by an order of magnitude. Camus is very easy to set up and is worth the time spent, however, one important thing to note while setting it up is that the camus jar and log4j.xml must be in HADOOP's path to run camus.
Spark SQL is used here for all the batch processing. For the data from [GitHub Archive] (https://www.githubarchive.org/), events like 'WatchEvent', 'ForkEvent', 'CommitCommentEvent' ae filtered as they are representative of the fact that a user has either contributed to a repository or is following one.
The data from GitHub archive has inconsistent schema. It is filtered in the batch process to a two column schema with a user and a list of all the repositories that he/she has contributed to or is following. Below is an example of schema from 2015 and 2012 and the resulting two column schema.
![Pipeline] (/flask/static/img/inconsistent\ schema.png)
The two column schema form the batch process is used to create a table in Cassndra ![Pipeline] (/flask/static/img/userrepo.png)
GitHub's API returns many fields in its json response for usernames and IDs which are not really important for the application. So, 12GB of raw data for 12M+ users is reduced to 450MB by extrcting login names and IDs in the batch process from the raw responses and is further compressed while consumed by Camus to 122MB.
![Pipeline] (/flask/static/img/users.png)
These usernames are used to pull records of who these users are following to create the Cassndra table below.
![Pipeline] (/flask/static/img/userfollowing.png)
Weekly trends are learned from 'WatchEvents' in GitHub archive and a table with repo names and corresponding watch counts is created.
Cassndra is used to to save the batch results and serve the front end. Three main tables serve the application-
- Userrepo- Key is the username and value is the list of repos that the user follows and has contributed to.
- Userfollow - Key is the username and value is the list of usernames of the people who user follows.
- Weeklytrends - Key is the reponame and value are the watch counts in the past week.
The API has the following endpoints-
- Trending
$ http://52.8.127.252:5000/api/trending
- Graph
$ http://52.8.127.252:5000/api/graph/username/year
Flask is use for the web app and D3 to visualize results.
Presentation for [GitHub Graph] (http://githubgraph.com/) can be found here
I encourage you to play with GitHub Graph and see trends you would be interested in.