All output files generated are hosted on dropbox.
A web interface for viewing the time series graphs of repository graphs can be seen in my Ninja-viewer repository
To produce a tool that queries githubarchive.org and the Github API, and generates longitudinal social network data and other time series for specified Github repositories. It will use a command line interface to execute queries.
- Ruby - 1.9+ - the programming language used to develop the tool
- BigQuery - BigQuery allows us to query all of the information stored by githubarchive.org. The bq command line tool is used for querying.
- Github API - The Github API allows us to retrieve who made commits as extended commit data is not available on BigQuery/githubarchive.org.
- igraph - Uses the igraph gem. This requires the igraph C library. Newest version doesn’t work, use 0.5.4 and use this command: gem install igraph -- --with-igraph-include=/usr/local/include/igraph --with-igraph-lib=/usr/local/lib
The output needs to be read by the R packages igraph and RSiena. The most complete format that these two use in common is GraphML, which the igraph ruby gem can output to. When the snapshots are output, they will be output one snapshot per file. For example, scanning the last 12 months of rubinius will output rubinius_rubinius_0...rubinius_rubinius_12. This is the format recommended for RSiena.
The aggregator queries three main contexts:
- Commits
- Pull Requests
- Issues
Edges are formed between developers using their interactions within these contexts. The following are the edges created:
- Commenting on a commit - Committer -- Commenter
- Commenting on a pull request - Pull Submitter -- Commenter
- Closing a pull request - Pull Submitter -- Closer
- Closing an issue - Issue Submitter -- Closer
- Commenting on an issue - Issue Submitter -- Commenter
We will utilize the following data sources:
- Github API
- Githubarchive.org data on BigQuery
BigQuery will be the primary data source, and most data will be pulled from there. The Github API will be used to retrieve information on commits, primarily, the user who made the commit, as commit data is not available on githubarchive.org.
For simplicity, the initial draft will use an undirected graph and all edges will be considered the same, without differentiating based on event. If there are multiple connections between nodes, the weight of the edge will just increase for each connection.
The initial draft will also only have developers as nodes of the graph. If necessary, this can be changed for a later stage, allowing artifacts such as files, pull requests, issues, and commits to be considered nodes, at which point there will also be edges created for developers submitting any of said artifacts.
For BigQuery, to save time and money on data processed, we will first pull the top 100 repositories (number can vary). From there we pull only the columns we need, on just the top 100 repositories, and store that dataset in BigQuery. This dataset is only 140MB which dramatically reduces costs, as it is $0.12 per GB of storage per month and $0.035 per GB processed with queries. Updating the dataset processes ~16GB of data.
When we query the bigquery data, we want to limit our requests to specific events, and we only need information on certain fields:
Field | Description | Used For |
---|---|---|
actor | The user involved in this event | Gives the name of a node |
payload_action | Specifies what action was performed during this event | Used to identify opened/closed on issues and pull requests |
type | What the github event was | Differentiates types of events so we can handle them differently |
payload_commit | sha of commit for this comment | Used with GithubAPI to retrieve the commit for this comment |
payload_number | The number that identifies this PR or Issue | Used to match opened/closed PR’s and Issues |
url | The URL for this event | Used for retrieving the payload_number for pull request comments, which don’t have it listed |
repository_name | Name of repository | Necessary because we retrieve information on repositories one at a time |
repository_owner | Owner of repository | Ditto above |
Processing each event:
Below I will outline the steps necessary for processing each event.
CommitCommentEvent:
Go through all of them, and collect all the commit sha’s(payload_commit). Make a unique list of these, then grab them all using the github_api gem, and group them by their sha(This step prevents us from grabbing the same commit multiple time).
Now go through all of them again, and for each one, make an edge between the “actor” and the commit owner(Which we retrieved from the API)
IssueEvent & PullRequestEvent:
Group them by opened/closed. For every closed one, generate an edge between the actor of it, and the actor of the event with ‘opened’ with the equivalent payload_number.
IssueCommentEvent:
For each one, generate an edge between the “actor” and the issue/PR owner(A hashmap of these is generated from the above step, use the “actor” for the open ones).
PullRequestReviewCommentEvent:
Pull the PR number out of the url. Use info we already got from PullRequestEvent to get the “open” event with that PR number. Generate an edge between the two actors.
- Derive a graph for the rubinius/rubinius repository since as far back as you can go. Only need to measure “strength of interactions” as a summary measure. One snapshot per month. DONE
- Be able to find the 100/250/500 largest repositories per forks at any given point in time, and then derive “strength of interactions” graphs for each of those repos at monthly snapshots. DONE
- Extract event streams with timestamps for the 100/250/500 repos selected in bullet 2. DONE
- Extract time series of forks, total community members, “pull requesters”, and committers monthly for all the repositories selected in bullet 2
- Add extra detail to edges - separate edges for measuring the strength of commits, pull request, and issues-based edges - but also needs to have a summary measures
- Extract directed/non-directed networks
- Implement all the flags below
- Implement artifacts as nodes (creating artifact-actor networks)
The scraper needs to be able to take the following flags:
- Static/dynamic network - on/off switch for whether a single network should be generated, or several snapshots over time
- Repositories to query (e.g. rubinius/rubinius)
- Time period to query (e.g. 2011-2012)
- Time unit for snapshots (days, months, quarters, half-years, or years)
- Level of granularity of relationships (either 1 relationship per context, or reduced to “interaction”, i.e. a single relationship for all interactions regardless of context)
- Directed or non-directed graph (i.e, either relationships are non-directed, as in “we are working on the same bug”, or they are directed as in “A commented on B’s commit” or “A merged B’s pull request)
- Strength of relationship on/off switch - either all relationships of the same type are equally valued, or they are evaluated by strength. I.e., if A and B interact frequently, their relationship will be strong (e.g. +1 for each interaction).
- Nodes - determines what counts as a node: artifacts, developers, or artifacts/developers