Skip to content

Conversation

@carlosms
Copy link
Contributor

@carlosms carlosms commented Jun 18, 2019

Fix #30.

Now there are 2 sub commands, deep for the previous code, and shallow for the new endpoints.
The download time for src-d is around 15m, and it takes around 500 API calls.

I chose to exit early when any error is found. The other alternative would be to log the error and continue, but I thought the errors we will find will probably be a broken internet connectivity, or the DB is down. In any of these, logging the errors and continue will just flood the logs with error messages and not really continue to do any work.

I will submit another incremental PR (#34) to get rid of all the FindOne DB calls for each individual issue and PR. But the time does not really improve, it seems most of the time is spent on the API calls.

carlosms added 5 commits June 17, 2019 17:47
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
@carlosms carlosms changed the title [WIP] List endpoints List endpoints Jun 19, 2019
@carlosms carlosms requested a review from a team June 19, 2019 09:05
Copy link
Contributor

@se7entyse7en se7entyse7en left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than the comments maybe we could extract the logic of doing a query, check if the item exists, and insert it if it doesn't. But this is just an idea for eventually future improvements.

var app = cli.New("ghsync", version, build, "GitHub metadata sync")

func main() {
app.AddCommand(&subcmd.ShallowCommand{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe clearer if renamed to ShallowSyncCommand? I'd have:

  • ShallowSyncCommand and SyncCommand, or
  • ShallowSyncCommand and DeepSyncCommand, or
  • ShallowCommand and DeepCommand as maybe the Sync is implicit in the project purpose itself.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done in 46f4e26

T http.RoundTripper
}

func (t *RemoveHeaderTransport) RoundTrip(req *http.Request) (*http.Response, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please explain the reason for removing these headers?

I was also reading the doc, and it says:

   // RoundTrip should not modify the request, except for
   // consuming and closing the Request's Body.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the reason. I didn't pay any attention to this, I just blindly copy-pasted from sync.go.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most probably if we don't remove these headers requests won't hit the cache. Our cache lib checks for headers as well. (though I didn't really check in details just saying according to my prior knowledge)

shallow/issue.go Outdated
}

for _, i := range issues {
if i.IsPullRequest() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that s.client.Issues.ListByRepo that returns both issues and prs, so is the output of this actually a superset (no pun intended) of the output returned by s.client.PullRequests.ListByRepo? Given that the API calls are the most time-consuming operations, can we just remove this check and work with both issues and prs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was thread on slack about this.

I thought about it, but the kallax table stores *github.PullRequest, which is not the same as *github.Issue. A PR contains a few more fields than an Issue, see the models for both, and https://developer.github.com/v3/pulls/#list-pull-requests & https://developer.github.com/v3/issues/#list-issues-for-a-repository.

I guess we could create an empty PR object and fill the fields that we do have from the Issue, since many are shared. But some others would be missing, for example things like BaseRepositoryName.
If we assume this data will only be consumed by us, and we make sure our charts do not use any of the PR specific fields, it would work... for now. But it would leave us in a very unstable situation.

If we want to improve this in any way, I would maybe do this:

  • store all Issues (including PRs) in the issues table
  • edit the issues model to include the PullRequestLinks field
  • edit our dashboards to use the issues table and separate issues from PRs but the empty PullRequestLinks column.

But in any case I think this is better to be considered as an improvement for the future, since it's not an obvious change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about improving this later but if I remember right the objects returned from list endpoint aren't complete anyway. To get ALL the data we have to call GET for each issue/pr. So the reasoning about "unstable situation" isn't really correct.
For example there is no comments field in PR list, only in the response for single PR. Refs:
https://developer.github.com/v3/pulls/#list-pull-requests
https://developer.github.com/v3/pulls/#get-a-single-pull-request

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue to continue this conversation: #40

return err
}

for _, r := range repositories {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how much this may impact on memory consumption, but it would be better to avoid storing the repos in an array and just call s.doRepo for each repo for each page.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was my first approach.

Then I changed it to retrieve all repositories first for this reason:
Imagine you have 101 repos.
GET the first page, and start processing one by one their issues and PRs.
Meanwhile a repo gets deleted, and the total number of repos is 100.
After a while we finish processing all the repos in the first page, and we GET page 2. But now github will say that there is no page 2, and we missed the processing of 1 repo.

Yes, this is unlikely, but that was my reasoning.
If we think the tradeoff of storing everything in memory is too big, I can change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

memory consumption here compare to a lot of other stuff we do is almost nothing, I wouldn't worry about it.
But I see a problem in the reasoning. The case when the last repo is deleted is even more unlike than deleting from somewhere in the middle. So we will get error anyway if something got deleted during import.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#41

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue to continue the conversation: #42

Signed-off-by: Carlos Martín <carlos.martin.sanchez@gmail.com>
shallow/issue.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's a good idea to wrap ALL the issues in a transaction.
There can be thousands of them so they won't be committed for quite a long time. (on wip branch for prettier/prettier it was taking minutes to download all of them) So when UI is open db is still empty and charts are ugly showing nulls.
I would better commit in batches by 100 for example.
Though maybe I'm missing some other case when batches can cause problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#39

shallow/issue.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improvement for the future. We need to save github ids instead of relying on owner and repo. It would help with renames and maybe something else in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#38

shallow/issue.go Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are doing hundreds to thousands of requests. The change of 5xx errors from github or any network problems is VERY high. Maybe we can keep it as is in this PR but we really need to improve it. Better before the release.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#37

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change the order PRs first and then Issues: most of the charts rely on prs table only, better to have data for them first. I understand that it doesn't really solve any problem but still, in some cases, it would improve UX and for us, it does matter which one to call fist.

Copy link
Contributor

@smacker smacker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created issues for my comments for future improvements. Good to go as soon as PR and Issues are reordered (because it's super small change)

@smacker smacker merged commit 9874cca into src-d:master Jun 19, 2019
@carlosms carlosms deleted the list-endpoints branch June 19, 2019 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use list endpoints for faster download times

3 participants