Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MM-60412] Fix potentially stuck init process #827

Merged
merged 2 commits into from
Oct 14, 2024
Merged

Conversation

streamer45
Copy link
Contributor

@streamer45 streamer45 commented Oct 10, 2024

Summary

The high concurrency of the init process could cause some permission errors when fetching channels for a previously joined team. Since we try to fetch the channels right after joining a team, this could naturally happen as a result of replica lag.

However, certain app nodes could consistently return 403 as somehow the model.Session.TeamMembers would be not in sync with the real state (user joined all teams). I couldn't fully understand why that is happening, so it remains something to potentially investigate separately.

The fix on this side is pretty naive: we merely force the init to happen on the first app server, preventing any of the above from happening.

Note: This is technically a breaking change since it requires an updated ltagent binary for the deployment step to work. I'm not sure how we want to handle this. We should probably merge this PR when we prepare the next release to minimize the chance of this causing issues. Let me know.

Ticket Link

https://mattermost.atlassian.net/browse/MM-60412

@streamer45 streamer45 added 2: Dev Review Requires review by a core committer Do Not Merge Should not be merged until this label is removed labels Oct 10, 2024
@streamer45 streamer45 self-assigned this Oct 10, 2024
@agnivade
Copy link
Member

If this could be a replica lag issue, how does pointing to a single server fix this?

@streamer45
Copy link
Contributor Author

If this could be a replica lag issue, how does pointing to a single server fix this?

@agnivade It's not replica lag since we repeat that action in a loop, so eventually, it would pass. Instead, it consistently errors on some nodes, pointing to a cache issue.

@agnivade
Copy link
Member

Understood. I previously thought this was related to something after deployment, hence suggested the cache invalidation solution. But IIUC, this is happening somewhere in between the genController process correct?

@streamer45
Copy link
Contributor Author

Understood. I previously thought this was related to something after deployment, hence suggested the cache invalidation solution. But IIUC, this is happening somewhere in between the genController process correct?

Yeah, it gets messed up during execution. With loadtest reset, we restart all the app servers so caches are clean prior to the init, but still, it reproduces.

Copy link
Member

@agarciamontoro agarciamontoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a request on the name of the flag, but otherwise it looks good. Thanks for taking a look!

@@ -65,11 +65,20 @@ func RunInitCmdF(cmd *cobra.Command, args []string) error {
return err
}

siteURLOverride, err := cmd.Flags().GetString("site-url")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename the flag to server-url? Or override-server-url? I ask because there's a SiteURL config already in the deployer, and it's already a bit confusing.

@agarciamontoro
Copy link
Member

Note: This is technically a breaking change since it requires an updated ltagent binary for the deployment step to work. I'm not sure how we want to handle this. We should probably merge this PR when we prepare the next release to minimize the chance of this causing issues. Let me know.

@streamer45 It's not the first time we merge breaking changes. We don't have a strict policy on this at this point, although we should at some point (I'm thinking of config changes for example, that are the most annoying for users). I'm ok merging this right before the next release (which I was planning to cut today!). I can wait for this to be merged and cut the branch right after.

@streamer45
Copy link
Contributor Author

Note: This is technically a breaking change since it requires an updated ltagent binary for the deployment step to work. I'm not sure how we want to handle this. We should probably merge this PR when we prepare the next release to minimize the chance of this causing issues. Let me know.

@streamer45 It's not the first time we merge breaking changes. We don't have a strict policy on this at this point, although we should at some point (I'm thinking of config changes for example, that are the most annoying for users). I'm ok merging this right before the next release (which I was planning to cut today!). I can wait for this to be merged and cut the branch right after.

I'll leave it to you, feel free to merge whenever makes sense.

Copy link
Member

@agarciamontoro agarciamontoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@streamer45 streamer45 added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer Do Not Merge Should not be merged until this label is removed labels Oct 14, 2024
@agarciamontoro agarciamontoro merged commit 147f26e into master Oct 14, 2024
1 check passed
@agarciamontoro agarciamontoro deleted the MM-60412 branch October 14, 2024 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants