-
Notifications
You must be signed in to change notification settings - Fork 468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
auth: de-dupe inflight requests #21801
auth: de-dupe inflight requests #21801
Conversation
8adc065
to
afffdad
Compare
Huh! Any theories about what? Weird that the latencies double from 512 to 1024 and again from 1024 to 2048, but then hold relatively steady after that. |
617dfd0
to
0fc48ca
Compare
Yeah it is quite strange. While looking at traces I did see once we hit 512 the Coordinator starts to thrash between There likely is also something else, maybe a periodic process, that is contributing. Using tokio-console or tracing would help us identify it, but I haven't had the chance to dig into it yet. |
pub enum Error { | ||
#[error(transparent)] | ||
InvalidPasswordFormat(#[from] AppPasswordParseError), | ||
#[error("invalid token format: {0}")] | ||
InvalidTokenFormat(#[from] jsonwebtoken::errors::Error), | ||
#[error("authentication token exchange failed: {0}")] | ||
ReqwestError(#[from] reqwest::Error), | ||
ReqwestError(Arc<reqwest::Error>), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are being Arc'd to allow for sending the same error to all inflight requests and avoid a bunch of clones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrapped some error types in Arcs because they surprisingly can't be Clone-d. I was think of wrapping the entire Error
in an Arc, but that changes the external API which I didn't really want to do
|
||
/// Makes a request to the provided URL, possibly de-duping by attaching a listener to an | ||
/// already in-flight request. | ||
async fn make_request<Req, Resp>(&self, url: String, req: Req) -> Result<Resp, Error> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used as &url
so can be made a &'static str
? Should avoid needing to .to_string()
the URLs skip some allocations, and the static should make it safe in the task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The URL is provided to us dynamically as a command line arg in envd so it can't be a &'static str
unfortunately 😢
src/frontegg-auth/src/client.rs
Outdated
use anyhow::Context; | ||
use mz_ore::collections::HashMap; | ||
use tokio::sync::oneshot; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
// Tell all of our waiters about the result. | ||
let response = result.map(|r| r.into_response()); | ||
for tx in waiters { | ||
let _ = tx.send(response.clone()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughts on Arc'ing the response to avoid so many clones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about wrapping the response and error in Arc
s, but that would require us to return Arc<Response>
to callers, which I didn't think was worth it because the callers were all using owned versions of the structs, so we'd need to clone them anyways. Happy to change this if you want me to
This PR updates our Frontegg Auth client to take ownership of refreshing tokens, and spawning a single task to manage the refresh for a group of de-duplicated requests. In #21801 we started de-duplicating authentication requests. This had the unintended consequence of breaking refresh tokens in some cases. A refresh token is returned from Frontegg upon successful auth and used to extend the expiration of a JWT. The refresh tokens are single use though. When we started de-duplicating requests we ended up with multiple sessions having the same refresh token, and all of them trying to use it. What ended up happening is a single session was able to refresh and all of the others were disconnected. This PR updates the logic in our Frontegg client to continue using the same de-duplication strategy (i.e. an `Arc<Mutex<HashMap<...>>>`) but now we spawn a task to automatically refresh tokens and give the caller a channel which we push updates to. It also adds a few metrics around Frontegg Auth. I tested this PR manually, but still TODO is add tests, but I wanted to get this PR up for early feedback if possible. Also note, my first attempt at solving this made everything task based, I got rid of the `Arc<Mutex<HashMap<...>>>`. I think this is a cleaner approach but I wasn't able to get it working with our existing integration tests with are half synchronous and half asynchronous. ### Motivation Fixes: https://github.com/MaterializeInc/materialize/issues/22096 ### Tips for reviewer This PR is split into three commits which might make it easier to review: 1. Refactor, moving `ExternalUserMetadata` into `mz_repr`. 2. Updates to the `mz_frontegg_auth` crate itself, this is the **most important commit**! 3. Update tests and callers of Authentication types to use the new API 4. Add tests to exercise duplicate sessions being refreshed together ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design. <!-- Reference the design in the description. --> - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](MaterializeInc/cloud#5021)). <!-- Ask in #team-cloud on Slack if you need help preparing the cloud PR. --> - [x] This PR includes the following [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note): - Fixes a bug where sessions using the same app password that start immediately after one another, could get invalidated early.
This PR updates the
Client
we use to make Frontegg Auth requests to dedupe inflight requests. Specifically, when a request is made we check if a request to that endpoint with those arguments is already inflight, if so, we do not issue a second request and instead we attach a listener/waiter to the already inflight request.Motivation
Helps improve: https://github.com/MaterializeInc/database-issues/issues/6537
Frontegg documents the API we use for getting auth tokens as accepting 100 requests per-second. As we attempt to scale to supporting thousands of concurrent connection requests (per-user), we hit Frontegg's request limit.
With this change + #21783 we have the following latencies when opening concurrent connections:
Something is still happening when we reach 512 connections, but this is about a 10x improvement over the current state of the world.
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.