-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: don't know how and where to slice up search results to stay within 1000 limit #22
Comments
@ioioio8888 what would be the criteria for useless repos? |
@ioioio8888 Most coins will (I guess) have very pages of repos with only very few repos having very many. It should theoretically follow a pareto distribution (so this will be a very interesting experiment to see if it's true in this situation). If it follows a pareto distribution and we use 2000 coins (for example) the number should drop off logarithmically(?) So even with Bitcoin at 32k the total number of requests might be something like 5 per coin on average. Since Bitcoin has like 50 billion repos, we need to narrow the search result (for a fair comparison). The github API has all the specifications of course, but you can filter by return only results with >x stars (and some more cool things) so we have to decide what's most relevant so that the search can be narrowed down by (for example) requiring 2+ stars. But based on my comment above this might not be an issue. |
@gazhayes @samuelralak For example, for searching stars >= 2 which contains "bitcoin" in the repo. If I narrow down to star = 2, the total number of repos is 1468, 468 repo is still being ignored. For example, if we slice it to 10 searchs(star = 2,3,4,5,6,7,8,9,10, >=11), it takes up 100 request per coin. we can call the search api 30 requests/min, so that we can call the api 1800 times every hour, 43200 times a day, which is clearly not enough. So that we may need to find a way to know if the coin have a large number repos like bitcoin, ETH, which requires slicing the query, or some other coins which can be finished in one request. |
The API will return the total number of results right? So we can use some logic there to decide if slicing is needed or not (and even where to slice)? I strongly suspect no more than 10 coins will actually need slicing. For now, you could just ignore the large ones like BTC and build everything you need for the coins that don't have more than 10 pages, myself or @samuelralak can claim this issue when we are ready to work out the slicing thing. |
The logic now in gitBloq works for the coins which have less than 1000 repos. If it is more than 1000, it will get 1000 repos. |
For now, it is fetching the top 1000 repos sorted by stars.
Since Github search api only returns 1000 results at once and the request limit for authenticated user is 30 request per minute. so that we need to define a proper filter rule when searching repo to filter out "useless" repo and narrow down the result within 1000 repo in one search.
The text was updated successfully, but these errors were encountered: