Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: don't know how and where to slice up search results to stay within 1000 limit #22

Open
ioioio8888 opened this issue Jun 25, 2019 · 5 comments
Assignees

Comments

@ioioio8888
Copy link
Member

For now, it is fetching the top 1000 repos sorted by stars.

Since Github search api only returns 1000 results at once and the request limit for authenticated user is 30 request per minute. so that we need to define a proper filter rule when searching repo to filter out "useless" repo and narrow down the result within 1000 repo in one search.

@samuelralak
Copy link
Member

samuelralak commented Jun 25, 2019

@ioioio8888 what would be the criteria for useless repos?

@gsovereignty
Copy link
Member

gsovereignty commented Jun 25, 2019

@ioioio8888
A request for 10 pages right? So 10 requests per coin?

Most coins will (I guess) have very pages of repos with only very few repos having very many. It should theoretically follow a pareto distribution (so this will be a very interesting experiment to see if it's true in this situation). If it follows a pareto distribution and we use 2000 coins (for example) the number should drop off logarithmically(?)

So even with Bitcoin at 32k the total number of requests might be something like 5 per coin on average.

@samuelralak

Since Bitcoin has like 50 billion repos, we need to narrow the search result (for a fair comparison). The github API has all the specifications of course, but you can filter by return only results with >x stars (and some more cool things) so we have to decide what's most relevant so that the search can be narrowed down by (for example) requiring 2+ stars.

But based on my comment above this might not be an issue.

@ioioio8888
Copy link
Member Author

ioioio8888 commented Jun 25, 2019

@gazhayes @samuelralak
1 request for 1 page(max 100 repos in 1 page), maximum 10 page(total 1000 repos, 10 requests) in one search.

For example, for searching stars >= 2 which contains "bitcoin" in the repo.
results
the total number of repos is 6498.
However, it can only return 1000 repos in one search, so that we cant get the remaining 5498 repos.
we have to narrow the result down to within 1000 repos, and split it to different searches(e.g. stars=2,3,4..), in order to get all the repos.

If I narrow down to star = 2, the total number of repos is 1468, 468 repo is still being ignored.
even if star=2 is okay, it used up 10 requests. we need to do the same thing for star=3, 4, 5, 6, 7...
each of them will consume 10 requests and when to stop is very hard to determined.

For example, if we slice it to 10 searchs(star = 2,3,4,5,6,7,8,9,10, >=11), it takes up 100 request per coin.
If there are 2k coins, we have to do the same search for each coin, since we don't know if they have a large number of repos or not.
Totally, it will be arround 200000 calls.

we can call the search api 30 requests/min, so that we can call the api 1800 times every hour, 43200 times a day, which is clearly not enough.

So that we may need to find a way to know if the coin have a large number repos like bitcoin, ETH, which requires slicing the query, or some other coins which can be finished in one request.

@gsovereignty
Copy link
Member

gsovereignty commented Jun 26, 2019

The API will return the total number of results right? So we can use some logic there to decide if slicing is needed or not (and even where to slice)? I strongly suspect no more than 10 coins will actually need slicing.

For now, you could just ignore the large ones like BTC and build everything you need for the coins that don't have more than 10 pages, myself or @samuelralak can claim this issue when we are ready to work out the slicing thing.

@gsovereignty gsovereignty changed the title Problem: Repo data does not contain all of the related repo Problem: don't know how and where to slice up search results to stay within 1000 limit Jun 26, 2019
@ioioio8888
Copy link
Member Author

The logic now in gitBloq works for the coins which have less than 1000 repos. If it is more than 1000, it will get 1000 repos.
I think the total number can be used to decide if slicing is needed or not.
For where to slice, since we don't know how the repo is distributed, may need to find a logic to find where to slice.

@samuelralak samuelralak self-assigned this Jun 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants