Added more docs

anvaka · Dec 1, 2014 · ee40fd9 · ee40fd9
1 parent 18466d9
commit ee40fd9
Showing 1 changed file with 65 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -18,9 +18,9 @@ To turn this fact into a number, I'm using [Sorensen-Dice](https://en.wikipedia.
 similarity coefficient:
 
 ```
-                            number_of_shared_stars(A, B)
-similarity(A, B) = -----------------------------------------
-                    number_of_stars(A) + number_of_stars(B)
+                        number_of_shared_stars(A, B)
+similarity(A, B) = ---------------------------------------
+                   number_of_stars(A) + number_of_stars(B)
 ```
 
 "Developers who gave star to this repository, also gave star to ..." metric
@@ -35,15 +35,15 @@ Node as the most relevant.
 [GitHub Archive](http://www.githubarchive.org/) provides gigabytes of data from
 GitHub. We can query it using [Google's BigQuery API](https://bigquery.cloud.google.com).
 
-For example, this simple query:
+For example, this query:
 
 ``` sql
 SELECT repository_url, actor_attributes_login
 FROM [githubarchive:github.timeline]
 WHERE type='WatchEvent'
 ```
 
-Will give us list of repositories, along with users who gave them stars:
+Give us list of repositories, along with users who gave them stars:
 
 ```
 | Row | repository_url                                     | actor_attributes_login |
@@ -57,10 +57,10 @@ Will give us list of repositories, along with users who gave them stars:
 ```
 
 By iteratively processing each record we can calculate number of stars for each
-project. We can also find out how many shared stars each project has with every
+project. We can also find how many shared stars each project has with every
 other project. But... The dataset is huge. Today (Nov 30, 2014) there are 25M
 watch events produced, by more than 1.8M unique users. They are given to more than
-1.2M unique repositories. We need to reduce this dataset:
+1.2M unique repositories. We need to reduce the dataset:
 
 ``` sql
 SELECT repository_url, actor_attributes_login
@@ -75,7 +75,7 @@ GROUP EACH BY repository_url, actor_attributes_login;
 
 ## Why do we limit lower bound to at least 2 stars?
 
-Since we are using  [Sorensen-Dice similarity coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient),
+Since we are using [Sorensen-Dice similarity coefficient](https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient),
 users who gave only 1 star total, can be excluded from "shared stars" metric.
 In theory this will slightly skew similarity coefficient and make two projects more
 similar than they should be, but in practice results seem to be helpful enough. This
@@ -88,6 +88,63 @@ than 500 stars.
 
 This query reduces dataset from 25M to 16M records.
 
+# Data storing
+
+We got the dataset, downloaded and stored into CSV file, for further processing.
+
+To calculate similarity we need to be able to quickly answer two questions:
+
+1. Who gave stars to `project A`?
+2. Which projects were starred by `User B`?
+
+If only we could save this into hash-like data structure - that would give us
+O(1) time to answer both of these questions.
+
+Naive solution to store all inside one process into hash (using either C++ or node)
+turned out to be extremely inefficient. My processes exceeded 8GB RAM limit,
+and started killing my laptop with constant swapping.
+
+Maybe I should save it into a local database?
+
+I tried to use [neo4j](http://neo4j.com/) but it failed with out of memory exception
+during CSV import.
+
+Next and last stop was [redis](http://redis.io/). Absolutely beautiful piece of
+software. It swallowed 16M rows without blinking an eye. RAM was within sane 3GB
+range, and disk utilization is only 700MB.
+
+# Building recommendations
+
+Recommendation database is created by these [~200 lines of code](https://github.com/anvaka/ghindex/blob/master/recommend.js).
+There is a lot of asynchronous code in there, hidden behind promises.
+
+In nutshell, this is what it's doing:
+
+```
+1. Find all repositories with more than 150 stars.
+2. For each repository find users who gave it a star.
+     For each user who gave a star, find which other projects were starred.
+     For each other project increase number of shared stars
+3. Produce similarity coefficient.
+```
+
+Final results are saved to disk, and then uploaded to S3, so that the [frontend](http://www.yasiv.com/github/)
+can immediately get them.
+
+# Final Notes
+
+It takes ~2 hours 20 minutes to construct recommendations for 15K most popular
+GitHub projects. It also takes another 40 minutes to prepare/download data from
+BigQuery.
+
+My previous approach, where I had to manually index GitHub via GitHub's API,
+was taking approximately 5 days to build the index, and one more day to calculate
+recommendations.
+
+GitHub Archive is awesome; Redis is awesome too! Maybe next step will be improving
+results with content-based recommendation. E.g. we could index source code and
+find similarity based on AST. Anyway, please let me know what you think.
+
 # license
 
 MIT