Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

auth: de-dupe inflight requests #21801

Merged
merged 3 commits into from
Sep 20, 2023

Conversation

ParkMyCar
Copy link
Member

@ParkMyCar ParkMyCar commented Sep 18, 2023

This PR updates the Client we use to make Frontegg Auth requests to dedupe inflight requests. Specifically, when a request is made we check if a request to that endpoint with those arguments is already inflight, if so, we do not issue a second request and instead we attach a listener/waiter to the already inflight request.

Motivation

Helps improve: https://github.com/MaterializeInc/database-issues/issues/6537

Frontegg documents the API we use for getting auth tokens as accepting 100 requests per-second. As we attempt to scale to supporting thousands of concurrent connection requests (per-user), we hit Frontegg's request limit.

With this change + #21783 we have the following latencies when opening concurrent connections:

num requests p50 p90 p99
32 34ms 371ms 495ms
64 25ms 285ms 367ms
128 31ms 189ms 331ms
256 66ms 565ms 660ms
512 4044ms 4828ms 4977ms
1024 9114ms 9880ms 10038ms
2048 20031ms 20784ms 20931ms
4096 21550ms 22269ms 22424ms
8192 23174ms 24440ms 24571ms

Something is still happening when we reach 512 connections, but this is about a 10x improvement over the current state of the world.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • N/a

@ParkMyCar ParkMyCar force-pushed the auth/dedupe-requests branch 2 times, most recently from 8adc065 to afffdad Compare September 19, 2023 20:38
@ParkMyCar ParkMyCar marked this pull request as ready for review September 19, 2023 20:49
@ParkMyCar ParkMyCar requested a review from a team as a code owner September 19, 2023 20:49
@benesch
Copy link
Member

benesch commented Sep 20, 2023

Something is still happening when we reach 512 connections, but this is about a 10x improvement over the current state of the world.

Huh! Any theories about what? Weird that the latencies double from 512 to 1024 and again from 1024 to 2048, but then hold relatively steady after that.

@ParkMyCar
Copy link
Member Author

Huh! Any theories about what? Weird that the latencies double from 512 to 1024 and again from 1024 to 2048, but then hold relatively steady after that.

Yeah it is quite strange. While looking at traces I did see once we hit 512 the Coordinator starts to thrash between Startup, GroupCommitInitiate, and Execute commands. Adding artificial latency to GroupCommitInitiate helped a bit because we started batching more. Once we got up to >2048 connections we started thrashing less because we had more Startup messages to handle at once, which resulted in us batching more group commits, which resulted in the Execute (SELECT 1 statements) coming in batches. I filed https://github.com/MaterializeInc/database-issues/issues/6552 to do some refactoring of group commits.

There likely is also something else, maybe a periodic process, that is contributing. Using tokio-console or tracing would help us identify it, but I haven't had the chance to dig into it yet.

pub enum Error {
#[error(transparent)]
InvalidPasswordFormat(#[from] AppPasswordParseError),
#[error("invalid token format: {0}")]
InvalidTokenFormat(#[from] jsonwebtoken::errors::Error),
#[error("authentication token exchange failed: {0}")]
ReqwestError(#[from] reqwest::Error),
ReqwestError(Arc<reqwest::Error>),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are being Arc'd to allow for sending the same error to all inflight requests and avoid a bunch of clones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrapped some error types in Arcs because they surprisingly can't be Clone-d. I was think of wrapping the entire Error in an Arc, but that changes the external API which I didn't really want to do


/// Makes a request to the provided URL, possibly de-duping by attaching a listener to an
/// already in-flight request.
async fn make_request<Req, Resp>(&self, url: String, req: Req) -> Result<Resp, Error>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used as &url so can be made a &'static str? Should avoid needing to .to_string() the URLs skip some allocations, and the static should make it safe in the task.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The URL is provided to us dynamically as a command line arg in envd so it can't be a &'static str unfortunately 😢

use anyhow::Context;
use mz_ore::collections::HashMap;
use tokio::sync::oneshot;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

// Tell all of our waiters about the result.
let response = result.map(|r| r.into_response());
for tx in waiters {
let _ = tx.send(response.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on Arc'ing the response to avoid so many clones?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about wrapping the response and error in Arcs, but that would require us to return Arc<Response> to callers, which I didn't think was worth it because the callers were all using owned versions of the structs, so we'd need to clone them anyways. Happy to change this if you want me to

@ParkMyCar ParkMyCar merged commit 02bf079 into MaterializeInc:main Sep 20, 2023
ParkMyCar added a commit that referenced this pull request Oct 5, 2023
This PR updates our Frontegg Auth client to take ownership of refreshing
tokens, and spawning a single task to manage the refresh for a group of
de-duplicated requests.

In #21801 we started de-duplicating authentication requests. This had
the unintended consequence of breaking refresh tokens in some cases. A
refresh token is returned from Frontegg upon successful auth and used to
extend the expiration of a JWT. The refresh tokens are single use
though. When we started de-duplicating requests we ended up with
multiple sessions having the same refresh token, and all of them trying
to use it. What ended up happening is a single session was able to
refresh and all of the others were disconnected.

This PR updates the logic in our Frontegg client to continue using the
same de-duplication strategy (i.e. an `Arc<Mutex<HashMap<...>>>`) but
now we spawn a task to automatically refresh tokens and give the caller
a channel which we push updates to.

It also adds a few metrics around Frontegg Auth.

I tested this PR manually, but still TODO is add tests, but I wanted to
get this PR up for early feedback if possible. Also note, my first
attempt at solving this made everything task based, I got rid of the
`Arc<Mutex<HashMap<...>>>`. I think this is a cleaner approach but I
wasn't able to get it working with our existing integration tests with
are half synchronous and half asynchronous.

### Motivation

Fixes: https://github.com/MaterializeInc/materialize/issues/22096

### Tips for reviewer

This PR is split into three commits which might make it easier to
review:

1. Refactor, moving `ExternalUserMetadata` into `mz_repr`.
2. Updates to the `mz_frontegg_auth` crate itself, this is the **most
important commit**!
3. Update tests and callers of Authentication types to use the new API
4. Add tests to exercise duplicate sessions being refreshed together

### Checklist

- [ ] This PR has adequate test coverage / QA involvement has been duly
considered.
- [ ] This PR has an associated up-to-date [design
doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md),
is a design doc
([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)),
or is sufficiently small to not require a design.
  <!-- Reference the design in the description. -->
- [ ] If this PR evolves [an existing `$T ⇔ Proto$T`
mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md)
(possibly in a backwards-incompatible way), then it is tagged with a
`T-proto` label.
- [ ] If this PR will require changes to cloud orchestration or tests,
there is a companion cloud PR to account for those changes that is
tagged with the release-blocker label
([example](MaterializeInc/cloud#5021)).
<!-- Ask in #team-cloud on Slack if you need help preparing the cloud
PR. -->
- [x] This PR includes the following [user-facing behavior
changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note):
- Fixes a bug where sessions using the same app password that start
immediately after one another, could get invalidated early.
@ParkMyCar ParkMyCar deleted the auth/dedupe-requests branch October 16, 2023 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants