-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MM-60412] Fix potentially stuck init process #827
Conversation
If this could be a replica lag issue, how does pointing to a single server fix this? |
@agnivade It's not replica lag since we repeat that action in a loop, so eventually, it would pass. Instead, it consistently errors on some nodes, pointing to a cache issue. |
Understood. I previously thought this was related to something after deployment, hence suggested the cache invalidation solution. But IIUC, this is happening somewhere in between the genController process correct? |
Yeah, it gets messed up during execution. With |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a request on the name of the flag, but otherwise it looks good. Thanks for taking a look!
cmd/ltagent/init.go
Outdated
@@ -65,11 +65,20 @@ func RunInitCmdF(cmd *cobra.Command, args []string) error { | |||
return err | |||
} | |||
|
|||
siteURLOverride, err := cmd.Flags().GetString("site-url") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we rename the flag to server-url
? Or override-server-url
? I ask because there's a SiteURL config already in the deployer, and it's already a bit confusing.
@streamer45 It's not the first time we merge breaking changes. We don't have a strict policy on this at this point, although we should at some point (I'm thinking of config changes for example, that are the most annoying for users). I'm ok merging this right before the next release (which I was planning to cut today!). I can wait for this to be merged and cut the branch right after. |
I'll leave it to you, feel free to merge whenever makes sense. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Summary
The high concurrency of the init process could cause some permission errors when fetching channels for a previously joined team. Since we try to fetch the channels right after joining a team, this could naturally happen as a result of replica lag.
However, certain app nodes could consistently return 403 as somehow the
model.Session.TeamMembers
would be not in sync with the real state (user joined all teams). I couldn't fully understand why that is happening, so it remains something to potentially investigate separately.The fix on this side is pretty naive: we merely force the init to happen on the first app server, preventing any of the above from happening.
Note: This is technically a breaking change since it requires an updated
ltagent
binary for the deployment step to work. I'm not sure how we want to handle this. We should probably merge this PR when we prepare the next release to minimize the chance of this causing issues. Let me know.Ticket Link
https://mattermost.atlassian.net/browse/MM-60412