auth: de-dupe inflight requests #21801

ParkMyCar · 2023-09-18T17:14:03Z

This PR updates the Client we use to make Frontegg Auth requests to dedupe inflight requests. Specifically, when a request is made we check if a request to that endpoint with those arguments is already inflight, if so, we do not issue a second request and instead we attach a listener/waiter to the already inflight request.

Motivation

Helps improve: https://github.com/MaterializeInc/database-issues/issues/6537

Frontegg documents the API we use for getting auth tokens as accepting 100 requests per-second. As we attempt to scale to supporting thousands of concurrent connection requests (per-user), we hit Frontegg's request limit.

With this change + #21783 we have the following latencies when opening concurrent connections:

num requests	p50	p90	p99
32	34ms	371ms	495ms
64	25ms	285ms	367ms
128	31ms	189ms	331ms
256	66ms	565ms	660ms
512	4044ms	4828ms	4977ms
1024	9114ms	9880ms	10038ms
2048	20031ms	20784ms	20931ms
4096	21550ms	22269ms	22424ms
8192	23174ms	24440ms	24571ms

Something is still happening when we reach 512 connections, but this is about a 10x improvement over the current state of the world.

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- N/a

benesch · 2023-09-20T02:46:22Z

Something is still happening when we reach 512 connections, but this is about a 10x improvement over the current state of the world.

Huh! Any theories about what? Weird that the latencies double from 512 to 1024 and again from 1024 to 2048, but then hold relatively steady after that.

ParkMyCar · 2023-09-20T15:37:00Z

Huh! Any theories about what? Weird that the latencies double from 512 to 1024 and again from 1024 to 2048, but then hold relatively steady after that.

Yeah it is quite strange. While looking at traces I did see once we hit 512 the Coordinator starts to thrash between Startup, GroupCommitInitiate, and Execute commands. Adding artificial latency to GroupCommitInitiate helped a bit because we started batching more. Once we got up to >2048 connections we started thrashing less because we had more Startup messages to handle at once, which resulted in us batching more group commits, which resulted in the Execute (SELECT 1 statements) coming in batches. I filed https://github.com/MaterializeInc/database-issues/issues/6552 to do some refactoring of group commits.

There likely is also something else, maybe a periodic process, that is contributing. Using tokio-console or tracing would help us identify it, but I haven't had the chance to dig into it yet.

maddyblue · 2023-09-20T15:24:48Z

src/frontegg-auth/src/error.rs

 pub enum Error {
    #[error(transparent)]
    InvalidPasswordFormat(#[from] AppPasswordParseError),
    #[error("invalid token format: {0}")]
    InvalidTokenFormat(#[from] jsonwebtoken::errors::Error),
    #[error("authentication token exchange failed: {0}")]
-    ReqwestError(#[from] reqwest::Error),
+    ReqwestError(Arc<reqwest::Error>),


These are being Arc'd to allow for sending the same error to all inflight requests and avoid a bunch of clones?

I wrapped some error types in Arcs because they surprisingly can't be Clone-d. I was think of wrapping the entire Error in an Arc, but that changes the external API which I didn't really want to do

maddyblue · 2023-09-20T15:27:44Z

src/frontegg-auth/src/client.rs

+
+    /// Makes a request to the provided URL, possibly de-duping by attaching a listener to an
+    /// already in-flight request.
+    async fn make_request<Req, Resp>(&self, url: String, req: Req) -> Result<Resp, Error>


This is only used as &url so can be made a &'static str? Should avoid needing to .to_string() the URLs skip some allocations, and the static should make it safe in the task.

The URL is provided to us dynamically as a command line arg in envd so it can't be a &'static str unfortunately 😢

maddyblue · 2023-09-20T15:29:45Z

src/frontegg-auth/src/client.rs

+use anyhow::Context;
+use mz_ore::collections::HashMap;
+use tokio::sync::oneshot;
+


Suggested change

maddyblue · 2023-09-20T15:49:27Z

src/frontegg-auth/src/client.rs

+                        // Tell all of our waiters about the result.
+                        let response = result.map(|r| r.into_response());
+                        for tx in waiters {
+                            let _ = tx.send(response.clone());


Thoughts on Arc'ing the response to avoid so many clones?

Yeah I thought about wrapping the response and error in Arcs, but that would require us to return Arc<Response> to callers, which I didn't think was worth it because the callers were all using owned versions of the structs, so we'd need to clone them anyways. Happy to change this if you want me to

This PR updates our Frontegg Auth client to take ownership of refreshing tokens, and spawning a single task to manage the refresh for a group of de-duplicated requests. In #21801 we started de-duplicating authentication requests. This had the unintended consequence of breaking refresh tokens in some cases. A refresh token is returned from Frontegg upon successful auth and used to extend the expiration of a JWT. The refresh tokens are single use though. When we started de-duplicating requests we ended up with multiple sessions having the same refresh token, and all of them trying to use it. What ended up happening is a single session was able to refresh and all of the others were disconnected. This PR updates the logic in our Frontegg client to continue using the same de-duplication strategy (i.e. an `Arc<Mutex<HashMap<...>>>`) but now we spawn a task to automatically refresh tokens and give the caller a channel which we push updates to. It also adds a few metrics around Frontegg Auth. I tested this PR manually, but still TODO is add tests, but I wanted to get this PR up for early feedback if possible. Also note, my first attempt at solving this made everything task based, I got rid of the `Arc<Mutex<HashMap<...>>>`. I think this is a cleaner approach but I wasn't able to get it working with our existing integration tests with are half synchronous and half asynchronous. ### Motivation Fixes: https://github.com/MaterializeInc/materialize/issues/22096 ### Tips for reviewer This PR is split into three commits which might make it easier to review: 1. Refactor, moving `ExternalUserMetadata` into `mz_repr`. 2. Updates to the `mz_frontegg_auth` crate itself, this is the **most important commit**! 3. Update tests and callers of Authentication types to use the new API 4. Add tests to exercise duplicate sessions being refreshed together ### Checklist - [ ] This PR has adequate test coverage / QA involvement has been duly considered. - [ ] This PR has an associated up-to-date [design doc](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/README.md), is a design doc ([template](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/design/00000000_template.md)), or is sufficiently small to not require a design.  - [ ] If this PR evolves [an existing `$T ⇔ Proto$T` mapping](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/command-and-response-binary-encoding.md) (possibly in a backwards-incompatible way), then it is tagged with a `T-proto` label. - [ ] If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label ([example](https://github.com/MaterializeInc/cloud/pull/5021)).  - [x] This PR includes the following [user-facing behavior changes](https://github.com/MaterializeInc/materialize/blob/main/doc/developer/guide-changes.md#what-changes-require-a-release-note): - Fixes a bug where sessions using the same app password that start immediately after one another, could get invalidated early.

ParkMyCar force-pushed the auth/dedupe-requests branch 2 times, most recently from 8adc065 to afffdad Compare September 19, 2023 20:38

ParkMyCar requested a review from maddyblue September 19, 2023 20:49

ParkMyCar marked this pull request as ready for review September 19, 2023 20:49

ParkMyCar requested a review from a team as a code owner September 19, 2023 20:49

ParkMyCar added 2 commits September 20, 2023 09:20

start, add an in-memory map of inflight requests to de-dupe them

ddb8691

type fix

0fc48ca

ParkMyCar force-pushed the auth/dedupe-requests branch from 617dfd0 to 0fc48ca Compare September 20, 2023 13:20

maddyblue reviewed Sep 20, 2023

View reviewed changes

delete newline

3725a6d

maddyblue approved these changes Sep 20, 2023

View reviewed changes

ParkMyCar merged commit 02bf079 into MaterializeInc:main Sep 20, 2023

ParkMyCar mentioned this pull request Oct 5, 2023

auth: De-dupe token refreshes into a single task #22184

Merged

5 tasks

ParkMyCar deleted the auth/dedupe-requests branch October 16, 2023 18:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

auth: de-dupe inflight requests #21801

auth: de-dupe inflight requests #21801

Uh oh!

ParkMyCar commented Sep 18, 2023 •

edited

Loading

Uh oh!

benesch commented Sep 20, 2023

Uh oh!

ParkMyCar commented Sep 20, 2023

Uh oh!

maddyblue Sep 20, 2023

Uh oh!

ParkMyCar Sep 20, 2023

Uh oh!

maddyblue Sep 20, 2023

Uh oh!

ParkMyCar Sep 20, 2023

Uh oh!

maddyblue Sep 20, 2023

Uh oh!

ParkMyCar Sep 20, 2023

Uh oh!

maddyblue Sep 20, 2023

Uh oh!

ParkMyCar Sep 20, 2023

Uh oh!

Uh oh!

auth: de-dupe inflight requests #21801

auth: de-dupe inflight requests #21801

Uh oh!

Conversation

ParkMyCar commented Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Checklist

Uh oh!

benesch commented Sep 20, 2023

Uh oh!

ParkMyCar commented Sep 20, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ParkMyCar commented Sep 18, 2023 •

edited

Loading