Creates GitHub index for similar repositories discovery. You can see working website here: Gazing Stargazers (Last time indexed on Jun 9 - Jun 13, 2014).
To index popular repositories (> 200 stars):
node repoIndexer.js --tokens="COMMA_SEPARATED_LIST_OF_GITHUB_TOKENS" > allrepo.json
This will save JSON stream of repositories with >= 200 stars into file allrepo.json
.
If you think something does not go right, you can enable logging, by setting ENABLE_LOG variable:
ENABLE_LOG=1 node repoIndexer.js --tokens=...
Note: COMMA_SEPARATED_LIST_OF_GITHUB_TOKENS
- can be just a single token, which you can create in Personal access tokens page.
Second step to building index of recommendations is to gather followers of popular repositories. To do so run:
node followersIndex.js allRepo.json ./db/followers --tokens="COMMA_SEPARATED_LIST_OF_GITHUB_TOKENS"
This will create a new leveldb database followers
inside db
folder. The database
will include all repositories from allrepo.json
along with users who gave them a star.
Last indexing step is to collect all repositories which are starred by found users. To do so run:
node starsIndexer.js ./db/followers ./db/stars --tokens="COMMA_SEPARATED_LIST_OF_GITHUB_TOKENS"
This will read all unique followers from the followers database ./db/followers
,
constructed in step 2, and will output results into database called ./db/stars
.
Each record in ./db/stars
will have user name as a key, and starred repositories
as a value.
This is the most time consuming step. As of Jun, 2014 GitHub had 13,000+ repositories with more than 200 stars. This translates to 600,000+ unique users, who gave stars to popular repositories.
Even though majority of users gave less than 100 stars to different projects, we still need to make at least one request to fetch stars. I.e. we need to make more than 600,000 requests to GitHub.
GitHub's current rate limit is 5,000 requests per hour, thus if we are indexing
with one token: 600,000/5,000 = 120
hours of work.
Good news, this indexer can be interrupted, and resumed at any time.
Now that we have all popular repositories with stargazers, let's construct recommendations database.
node constructRecommendations.js ./db/followers ./db/stars
This will read followers database produced in step 2, and stars database produced in step 3. Results will be stored into two folders:
out
- will contain username/repoName.json
files for each repository with more than 200 stars
projects
- will contain only projects.json
. This file is array of all indexed repositories.
Normally this will be enough. I'm also uploading results into s3 bucket. This bucket then used by yasiv frontend: http://www.yasiv.com/github/
If I can summarize this entire project in one line of code, here it is:
var index = 100 * 2 * sharedStarsCount/(analyzeRatio * (theirStarsCount + ourStarsCount));
This is variation of Sørensen–Dice coefficient. While it gives pretty decent starting point for projects discovery on GitHub, it is not perfect. If you want to help me improve recommendations please feel free to reach out to me: Open PR/tweet to me/email me.
PS: If you are working for GitHub, can you please make this feature part of GitHub?
MIT