-
-
Notifications
You must be signed in to change notification settings - Fork 955
Separate backoff strategies for peer registration vs login operations #4835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Separate backoff strategies for peer registration vs login operations #4835
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughAdds grouped retry/backoff constants and internal helpers to select registration vs login backoff strategies; detects registration requests; updates login retry flow to use the chosen backoff and distinguishes Cancelled vs DeadlineExceeded errors for retry behavior. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as gRPC caller
participant GRPC as management gRPC client
participant Detector as isRegistrationRequest
participant Selector as Backoff selector
participant Backoff as Backoff engine
Caller->>GRPC: Call Login RPC with req
GRPC->>Detector: isRegistrationRequest(req)?
alt registration
Detector-->>Selector: registration
Selector-->>Backoff: newRegistrationBackoff(ctx)
else login
Detector-->>Selector: login
Selector-->>Backoff: newLoginBackoff(ctx)
end
rect `#f2f8ff`
Note over Backoff: If error == Canceled → stop\nIf error == DeadlineExceeded → retry (debug log)
end
Backoff-->>GRPC: retry attempts / final error
GRPC-->>Caller: return response / error
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🧰 Additional context used🧬 Code graph analysis (1)shared/management/client/grpc.go (3)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (13)
🔇 Additional comments (5)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR separates backoff strategies for peer registration and login operations by introducing distinct timeout configurations. Previously, both operations used a single 10-second backoff timeout, but registration operations (involving database writes and IdP validation) require longer timeouts than login operations.
Key changes:
- Auto-detects operation type by checking for SetupKey or JwtToken in the request
- Implements separate backoff configurations: 180s timeout for registration (5s initial, 30s max interval) vs 45s for login (3s initial, 15s max interval)
- Adds helper functions for creating backoff strategies while maintaining backward compatibility
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if s.Code() == codes.DeadlineExceeded { | ||
| log.Debugf("login timeout, retrying...") | ||
| return err | ||
| } |
Copilot
AI
Nov 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message 'login timeout, retrying...' doesn't distinguish between registration and login operations. Since these now have different timeout configurations, the message should indicate which operation timed out to aid debugging. Consider using a message like 'operation timeout, retrying...' or including the operation type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
shared/management/client/grpc.go (3)
325-349: Consider usingNewExponentialBackOff+ opts instead of struct literalsThe two helpers mirror
defaultBackoffby instantiatingExponentialBackOffvia struct literal and passing it directly toWithContext. The backoff/v4 docs state thatResetshould be called before using anExponentialBackOff; usingNewExponentialBackOffwithWithInitialInterval/WithMaxInterval/WithMaxElapsedTime(or explicitly callingReset()once after construction) better aligns with that contract and is more robust to internal field changes. (pkg.go.dev)You could keep the same constants but build the policies like this:
-func newRegistrationBackoff(ctx context.Context) backoff.BackOff { - return backoff.WithContext(&backoff.ExponentialBackOff{ - InitialInterval: RegistrationRetryInterval, - RandomizationFactor: backoff.DefaultRandomizationFactor, - Multiplier: backoff.DefaultMultiplier, - MaxInterval: RegistrationRetryMaxDelay, - MaxElapsedTime: RegistrationRetryTimeout, - Stop: backoff.Stop, - Clock: backoff.SystemClock, - }, ctx) -} +func newRegistrationBackoff(ctx context.Context) backoff.BackOff { + b := backoff.NewExponentialBackOff( + backoff.WithInitialInterval(RegistrationRetryInterval), + backoff.WithMaxInterval(RegistrationRetryMaxDelay), + backoff.WithMaxElapsedTime(RegistrationRetryTimeout), + ) + return backoff.WithContext(b, ctx) +} @@ -func newLoginBackoff(ctx context.Context) backoff.BackOff { - return backoff.WithContext(&backoff.ExponentialBackOff{ - InitialInterval: LoginRetryInterval, - RandomizationFactor: backoff.DefaultRandomizationFactor, - Multiplier: backoff.DefaultMultiplier, - MaxInterval: LoginRetryMaxDelay, - MaxElapsedTime: LoginRetryTimeout, - Stop: backoff.Stop, - Clock: backoff.SystemClock, - }, ctx) -} +func newLoginBackoff(ctx context.Context) backoff.BackOff { + b := backoff.NewExponentialBackOff( + backoff.WithInitialInterval(LoginRetryInterval), + backoff.WithMaxInterval(LoginRetryMaxDelay), + backoff.WithMaxElapsedTime(LoginRetryTimeout), + ) + return backoff.WithContext(b, ctx) +}This keeps the behavior but relies on the library’s initialization path rather than on its struct layout.
351-355: Registration detection viaSetupKey/JwtTokenis clear but becomes the contractUsing
SetupKeyorJwtTokenpresence to classify aLoginRequestas a registration matches howRegistervsLoginpopulate the request, so the split backoff will behave as expected for current call sites. Just be aware this effectively codifies “any flow that sets eitherSetupKeyorJwtTokenis treated as registration” — future callers that happen to populate those fields will automatically inherit the longer registration retry window.
379-386: Clarify desired retry behavior forCanceledand use of parent context inloginInside
operation, bothcodes.Canceledandcodes.DeadlineExceededare returned as plain errors, sobackoff.Retrywill keep retrying them until the chosen backoff (registration or login) stops orc.ctxis canceled; all other status codes and non‑gRPC errors become permanent viabackoff.Permanent(err). If the intent is forCanceled(e.g., user‑driven cancellation) to stop retries, you may want to wrap it withbackoff.Permanent(err)as well or gate onc.ctxexplicitly before retrying.Also,
mgmCtxis still derived fromcontext.Background(), so cancelingc.ctxstops scheduling new retries but does not abort an in‑flightLogincall until its ownConnectTimeoutelapses. If you want immediate cancellation behavior on shutdown, consider usingc.ctxas the parent here:- mgmCtx, cancel := context.WithTimeout(context.Background(), ConnectTimeout) + mgmCtx, cancel := context.WithTimeout(c.ctx, ConnectTimeout)That would tie per‑RPC cancellation to the client’s lifecycle while still respecting the new backoff windows.
Also applies to: 394-402
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
shared/management/client/grpc.go(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
client/proto/daemon.pb.go (3)
LoginRequest(234-292)LoginRequest(305-305)LoginRequest(320-322)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Client / Unit (386)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
- GitHub Check: iOS / Build
- GitHub Check: Android / Build
- GitHub Check: release_ui_darwin
- GitHub Check: release
- GitHub Check: Linux
- GitHub Check: Darwin
- GitHub Check: Windows
- GitHub Check: JS / Lint
- GitHub Check: Client / Unit
🔇 Additional comments (1)
shared/management/client/grpc.go (1)
28-38: Timeout split between registration and login looks reasonable; verify overall UXThe separate retry windows (180s for registration, 45s for login) align with the stated goals of “slow to give up on registration, fast to fail login”. Just double‑check in practice that, combined with the 10s per‑RPC
ConnectTimeoutand exponential backoff, the worst‑case wall‑clock behavior still fits CLI/daemon expectations (e.g., not surprising users with overly long waits on repeated failures).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
shared/management/client/grpc.go (1)
379-402: LGTM! Correct error handling and backoff selection logic.The enhanced error handling properly distinguishes between retryable errors (Canceled, DeadlineExceeded) and permanent errors. The conditional backoff strategy selection based on request type is correctly implemented.
The context usage pattern is well-designed: using
c.ctxfor the backoff respects the client's lifecycle, while usingcontext.Background()for individual retry attempts ensures each gets a fresh timeout.Optional improvement: The log message at line 384 could indicate whether it's a registration or login timeout to aid debugging, as noted in the previous review. However, this is a minor observability enhancement.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
shared/management/client/grpc.go(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
client/proto/daemon.pb.go (3)
LoginRequest(234-292)LoginRequest(305-305)LoginRequest(320-322)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (17)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Client / Unit (386)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: release_ui_darwin
- GitHub Check: release
- GitHub Check: Client / Unit
- GitHub Check: Android / Build
- GitHub Check: Linux
- GitHub Check: Client / Unit
- GitHub Check: Windows
- GitHub Check: Darwin
- GitHub Check: JS / Lint
🔇 Additional comments (4)
shared/management/client/grpc.go (4)
28-38: LGTM! Well-organized backoff constants.The separate backoff strategies are appropriate: registration operations get more generous timeouts (180s total, 5s initial, 30s max) compared to login operations (45s total, 3s initial, 15s max). The grouping in a single const block improves readability.
325-336: LGTM! Correct backoff implementation.The registration backoff helper is properly configured with exponential backoff parameters and context integration for cancellation support.
338-349: LGTM! Correct backoff implementation.The login backoff helper is properly configured with exponential backoff parameters and context integration for cancellation support.
351-355: LGTM! Detection logic is correct.The function correctly identifies registration requests by checking for either SetupKey or JwtToken presence. The comment clarification from the previous review has been addressed appropriately.
4b04d0a to
f50db3c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
shared/management/client/grpc.go (1)
383-386: Enhance log message to distinguish operation type.Since registration and login now have different timeout configurations (180s vs 45s), the debug message should indicate which operation timed out to aid troubleshooting.
Apply this diff:
+ operationType := "login" + if isRegistrationRequest(req) { + operationType = "registration" + } if s.Code() == codes.DeadlineExceeded { - log.Debugf("login timeout, retrying...") + log.Debugf("%s timeout, retrying...", operationType) return err }
🧹 Nitpick comments (1)
shared/management/client/grpc.go (1)
28-38: Consider adding documentation for retry constants.While the constant names are self-explanatory, brief comments explaining the rationale for these specific timeout values would improve maintainability. For example, why registration needs 180s vs login's 45s (database writes, external IdP validation, etc.).
Example:
const ( ConnectTimeout = 10 * time.Second + // Registration retry constants - longer timeouts to accommodate database writes, + // external IdP validation, and group calculations RegistrationRetryTimeout = 180 * time.Second RegistrationRetryInterval = 5 * time.Second RegistrationRetryMaxDelay = 30 * time.Second + // Login retry constants - shorter timeouts for faster failure feedback LoginRetryTimeout = 45 * time.Second LoginRetryInterval = 3 * time.Second LoginRetryMaxDelay = 15 * time.Second )
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
shared/management/client/grpc.go(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
shared/management/client/grpc.go (1)
client/proto/daemon.pb.go (3)
LoginRequest(234-292)LoginRequest(305-305)LoginRequest(320-322)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Signal / Unit (386)
- GitHub Check: Client / Unit (386)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Android / Build
- GitHub Check: iOS / Build
- GitHub Check: Client / Unit
- GitHub Check: release_ui_darwin
- GitHub Check: release
- GitHub Check: release_ui
- GitHub Check: JS / Lint
- GitHub Check: Client / Unit
- GitHub Check: Windows
- GitHub Check: Linux
- GitHub Check: Darwin
🔇 Additional comments (1)
shared/management/client/grpc.go (1)
394-401: Excellent dynamic backoff strategy selection.The conditional logic cleanly separates registration from login backoff strategies based on request introspection, avoiding the need for additional boolean parameters. This maintains backward compatibility while achieving the PR's objectives.
| if s, ok := gstatus.FromError(err); ok { | ||
| if s.Code() == codes.Canceled { | ||
| return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider treating Canceled errors as permanent.
The current code retries codes.Canceled errors, but cancellation typically indicates intentional termination. In handleStream (lines 177-179), Canceled is treated as a normal exit. Should the same approach apply here?
If cancellation should stop retries immediately, apply this diff:
if s, ok := gstatus.FromError(err); ok {
if s.Code() == codes.Canceled {
- return err
+ return backoff.Permanent(err)
}
if s.Code() == codes.DeadlineExceeded {📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if s, ok := gstatus.FromError(err); ok { | |
| if s.Code() == codes.Canceled { | |
| return err | |
| if s, ok := gstatus.FromError(err); ok { | |
| if s.Code() == codes.Canceled { | |
| return backoff.Permanent(err) |
🤖 Prompt for AI Agents
In shared/management/client/grpc.go around lines 379-381, the code currently
allows canceled RPCs to be retried; treat codes.Canceled as a permanent/terminal
error by returning immediately and preventing any retry logic. Update the error
handling so that when gstatus.FromError(err) yields s.Code() == codes.Canceled
you exit the retry loop and return the error (or set the retry flag to false) so
no further retry attempts are made.
The previous implementation used a single 10 second backoff for both registration and login operations. However, these are fundamentally different operations with different timeout requirements: Registration (new peer creation) legitimately needs longer timeouts because it involves database writes, external IdP validation, and group membership calculations. Login (existing peer authentication) is faster and should fail quickly to provide actionable feedback to users. This change: - Auto-detects operation type by checking for SetupKey or JwtToken in the request - Uses 180s timeout for registration (5s initial, 30s max interval) - Uses 45s timeout for login (3s initial, 15s max interval) - Extracts backoff creation into helper functions to keep login() complexity low - Maintains backward compatibility - no API changes to public methods The auto-detection approach eliminates the need for boolean parameters and keeps the login() function simple while providing appropriate timeouts for each scenario.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
f50db3c to
89271f2
Compare
|



The previous implementation used a single 10 second backoff for both registration and login operations, default GRPC timeout. However, these are fundamentally different operations with different timeout requirements:
Registration (new peer creation) legitimately needs longer timeouts because it involves database writes, external IdP validation, and group membership calculations. Login (existing peer authentication) is faster and should fail quickly to provide actionable feedback to users.
Describe your changes
This change:
The auto-detection approach eliminates the need for boolean parameters and keeps the login() function simple while providing appropriate timeouts for each scenario.
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.