-
Notifications
You must be signed in to change notification settings - Fork 8
List endpoints #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
List endpoints #32
Conversation
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than the comments maybe we could extract the logic of doing a query, check if the item exists, and insert it if it doesn't. But this is just an idea for eventually future improvements.
cmd/ghsync/main.go
Outdated
| var app = cli.New("ghsync", version, build, "GitHub metadata sync") | ||
|
|
||
| func main() { | ||
| app.AddCommand(&subcmd.ShallowCommand{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe clearer if renamed to ShallowSyncCommand? I'd have:
ShallowSyncCommandandSyncCommand, orShallowSyncCommandandDeepSyncCommand, orShallowCommandandDeepCommandas maybe theSyncis implicit in the project purpose itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done in 46f4e26
cmd/ghsync/subcmd/common.go
Outdated
| T http.RoundTripper | ||
| } | ||
|
|
||
| func (t *RemoveHeaderTransport) RoundTrip(req *http.Request) (*http.Response, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please explain the reason for removing these headers?
I was also reading the doc, and it says:
// RoundTrip should not modify the request, except for // consuming and closing the Request's Body.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the reason. I didn't pay any attention to this, I just blindly copy-pasted from sync.go.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
most probably if we don't remove these headers requests won't hit the cache. Our cache lib checks for headers as well. (though I didn't really check in details just saying according to my prior knowledge)
shallow/issue.go
Outdated
| } | ||
|
|
||
| for _, i := range issues { | ||
| if i.IsPullRequest() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess that s.client.Issues.ListByRepo that returns both issues and prs, so is the output of this actually a superset (no pun intended) of the output returned by s.client.PullRequests.ListByRepo? Given that the API calls are the most time-consuming operations, can we just remove this check and work with both issues and prs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was thread on slack about this.
I thought about it, but the kallax table stores *github.PullRequest, which is not the same as *github.Issue. A PR contains a few more fields than an Issue, see the models for both, and https://developer.github.com/v3/pulls/#list-pull-requests & https://developer.github.com/v3/issues/#list-issues-for-a-repository.
I guess we could create an empty PR object and fill the fields that we do have from the Issue, since many are shared. But some others would be missing, for example things like BaseRepositoryName.
If we assume this data will only be consumed by us, and we make sure our charts do not use any of the PR specific fields, it would work... for now. But it would leave us in a very unstable situation.
If we want to improve this in any way, I would maybe do this:
- store all Issues (including PRs) in the issues table
- edit the issues model to include the
PullRequestLinksfield - edit our dashboards to use the
issuestable and separate issues from PRs but the emptyPullRequestLinkscolumn.
But in any case I think this is better to be considered as an improvement for the future, since it's not an obvious change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree about improving this later but if I remember right the objects returned from list endpoint aren't complete anyway. To get ALL the data we have to call GET for each issue/pr. So the reasoning about "unstable situation" isn't really correct.
For example there is no comments field in PR list, only in the response for single PR. Refs:
https://developer.github.com/v3/pulls/#list-pull-requests
https://developer.github.com/v3/pulls/#get-a-single-pull-request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue to continue this conversation: #40
shallow/repository.go
Outdated
| return err | ||
| } | ||
|
|
||
| for _, r := range repositories { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how much this may impact on memory consumption, but it would be better to avoid storing the repos in an array and just call s.doRepo for each repo for each page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That was my first approach.
Then I changed it to retrieve all repositories first for this reason:
Imagine you have 101 repos.
GET the first page, and start processing one by one their issues and PRs.
Meanwhile a repo gets deleted, and the total number of repos is 100.
After a while we finish processing all the repos in the first page, and we GET page 2. But now github will say that there is no page 2, and we missed the processing of 1 repo.
Yes, this is unlikely, but that was my reasoning.
If we think the tradeoff of storing everything in memory is too big, I can change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
memory consumption here compare to a lot of other stuff we do is almost nothing, I wouldn't worry about it.
But I see a problem in the reasoning. The case when the last repo is deleted is even more unlike than deleting from somewhere in the middle. So we will get error anyway if something got deleted during import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened an issue to continue the conversation: #42
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
shallow/issue.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's a good idea to wrap ALL the issues in a transaction.
There can be thousands of them so they won't be committed for quite a long time. (on wip branch for prettier/prettier it was taking minutes to download all of them) So when UI is open db is still empty and charts are ugly showing nulls.
I would better commit in batches by 100 for example.
Though maybe I'm missing some other case when batches can cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shallow/issue.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improvement for the future. We need to save github ids instead of relying on owner and repo. It would help with renames and maybe something else in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shallow/issue.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are doing hundreds to thousands of requests. The change of 5xx errors from github or any network problems is VERY high. Maybe we can keep it as is in this PR but we really need to improve it. Better before the release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shallow/repository.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please change the order PRs first and then Issues: most of the charts rely on prs table only, better to have data for them first. I understand that it doesn't really solve any problem but still, in some cases, it would improve UX and for us, it does matter which one to call fist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created issues for my comments for future improvements. Good to go as soon as PR and Issues are reordered (because it's super small change)
Fix #30.
Now there are 2 sub commands,
deepfor the previous code, andshallowfor the new endpoints.The download time for
src-dis around 15m, and it takes around 500 API calls.I chose to exit early when any error is found. The other alternative would be to log the error and continue, but I thought the errors we will find will probably be a broken internet connectivity, or the DB is down. In any of these, logging the errors and continue will just flood the logs with error messages and not really continue to do any work.
I will submit another incremental PR (#34) to get rid of all the
FindOneDB calls for each individual issue and PR. But the time does not really improve, it seems most of the time is spent on the API calls.