-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix the reason label of authentication metrics #20030
[fix][broker] Fix the reason label of authentication metrics #20030
Conversation
|
||
try { | ||
if (users.get(userId) == null) { | ||
throw new AuthenticationException(msg); | ||
throw new PulsarAuthenticationException(msg, errorCode); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, use ErrorCode.BASIC_INVALID_AUTH_DATA
directly will be more clear. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll fix this.
@michaeljmarshall would you mind to taking a look here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change looks good me.
And I have added the release labels. Since it's a metrics changes that might impact the Pulsar operators, maybe not, I don't know someone will add alert rules or something based on the long reason message. But it still worth to start a discussion for cherry-picking the changes to release branches.
And please also update the documentation after the RP get merged. |
/pulsarbot run-failure-checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this @gaoran10, I just noticed this problem last week. It's definitely one we need to fix.
My primary concern is with the PulsarAuthenticationException
class because it isn't really extensible for custom implementations of the AuthenticationProvider
interface.
I also wonder if it would be better to add a new class that makes it so we can avoid each implementation from needing to increment the metrics on its own. As long as a provider throws a PulsarAuthenticationException
, we could increment a metric in the broker/proxy/function worker/websocket proxy.
Let me know what you think. Thanks!
@@ -157,8 +158,7 @@ public String authenticate(AuthenticationDataSource authData) throws Authenticat | |||
AuthenticationMetrics.authenticateSuccess(getClass().getSimpleName(), getAuthMethodName()); | |||
return role; | |||
} catch (AuthenticationException exception) { | |||
AuthenticationMetrics.authenticateFailure(getClass().getSimpleName(), getAuthMethodName(), | |||
exception.getMessage()); | |||
AuthenticationMetrics.authenticateFailure(getClass().getSimpleName(), getAuthMethodName(), exception); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be valuable to increment the metric outside of the implementations of the AuthenticationProvider
so that all custom implementations can integrate with this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I think we can calculate the metric in the AuthenticationService
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it's hard to find a unified entrance to increment the authentication metric. Maybe we can try to improve this in another PR. WDYT
public enum ErrorCode { | ||
UNKNOWN, | ||
PROVIDER_LIST_AUTH_REQUIRED, | ||
BASIC_INVALID_TOKEN, | ||
BASIC_INVALID_AUTH_DATA, | ||
AUTHZ_NO_CLIENT, | ||
AUTHZ_NO_TOKEN, | ||
AUTHZ_NO_PUBLIC_KEY, | ||
AUTHZ_DOMAIN_MISMATCH, | ||
AUTHZ_INVALID_TOKEN, | ||
TLS_NO_CERTS, | ||
TLS_NO_CN, // cn: common name | ||
TOKEN_INVALID_HEADER, | ||
TOKEN_NO_AUTH_DATA, | ||
TOKEN_EMPTY_TOKEN, | ||
TOKEN_INVALID_TOKEN, | ||
TOKEN_INVALID_AUDIENCES, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My primary concern with relying on this enum is that it is not easily extended. For example, I just merged the OIDC Provider with the following enum:
Lines 25 to 38 in bafecb2
public enum AuthenticationExceptionCode { | |
UNSUPPORTED_ISSUER, | |
UNSUPPORTED_ALGORITHM, | |
ISSUER_MISMATCH, | |
ALGORITHM_MISMATCH, | |
INVALID_PUBLIC_KEY, | |
ERROR_RETRIEVING_PROVIDER_METADATA, | |
ERROR_RETRIEVING_PUBLIC_KEY, | |
ERROR_DECODING_JWT, | |
ERROR_VERIFYING_JWT, | |
ERROR_VERIFYING_JWT_SIGNATURE, | |
INVALID_JWT_CLAIM, | |
EXPIRED_JWT, | |
} |
From your list and from mine, my primary conclusion is that the types of errors are often provider specific, so a static enum that cannot be extended feels restrictive.
I agree that it is extremely important to limit the cardinality of these metrics, but I also think we want to make it easily extensible by users.
I think an alternative solution could work by making PulsarAuthenticationException
abstract and letting each provider implement their own exception. Then, the provider itself could have its own enum, and when the framework calls getErrorCode
, the implementation will convert its own enum into a string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! I'll change this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add an abstract class BaseAuthenticationException
, which extends AuthenticationException
and has an abstract method public abstract Enum<?> getErrorCode()
, I think developers can custom their exceptions based on the BaseAuthenticationException
, the custom exception can provide the custom error code. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My primary concern is that ErrorCode
has many enums that are specific to each provider and are not generally useful. What do you think about the following class?
public class PulsarAuthenticationException extends AuthenticationException {
private final String errorCode;
// Warning, the errorLabel must not have high cardinality
public PulsarAuthenticationException(String message, String errorLabel) {
super(message);
this.errorLabel = errorLabel;
}
public String getErrorLabel() {
return errorLabel;
}
}
I think it is reasonable to expect maintainers to know that the errorLabel
must not be high cardinality, and this way, we don't have enumerations that combine unrelated labels. If a provider wants to use an enum for its own tracking, that would be perfectly reasonable, and it would probably work just like the AuthenticationProviderOpenID
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the PulsarAuthenticationException
, and only added some error codes for every authentication provider, like AuthenticationProviderOpenID
. PTAL
public static void authenticateFailure(String providerName, String authMethod, | ||
AuthenticationException exception) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is a generic metrics class meant to be useful to all authentication provider implementations, I think we should provide an easy way to provide custom "reasons". I respect that it is easy to provide a reason that is high cardinality, but I think we should expect each implementation to handle those details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I add an abstract class BaseAuthenticationException
. PTAL
@michaeljmarshall PTAL |
/pulsarbot run-failure-checks |
e434b54
to
6612405
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes @gaoran10. I left a couple comments.
@@ -89,7 +89,7 @@ public class AuthenticationProviderOpenID implements AuthenticationProvider { | |||
private static final String SIMPLE_NAME = AuthenticationProviderOpenID.class.getSimpleName(); | |||
|
|||
// Must match the value used by the OAuth2 Client Plugin. | |||
private static final String AUTH_METHOD_NAME = "token"; | |||
private static final String AUTH_METHOD_NAME = "openid"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs to be token
in order to support the AuthenticationProviderList
. Note that the oauth2 client plugin uses token
as well. Also, it is valid to use the token authentication data provider in cases where the oauth2 token is fetched for you, like the k8s service account token.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I revert this change.
try { | ||
// Get Token | ||
String token; | ||
token = getToken(authData); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to increment the INVALID_AUTH_DATA
error in the getToken
method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method getToken
is static, some methods use it and catch exceptions to record failure metrics, such as OpenID provider, if we increment the INVALID_AUTH_DATA
error in the getToken
method, this will generate repeat failure records.
OpenID provider
CompletableFuture<DecodedJWT> authenticateTokenAsync(AuthenticationDataSource authData) {
String token;
try {
token = AuthenticationProviderToken.getToken(authData);
} catch (AuthenticationException e) {
incrementFailureMetric(AuthenticationExceptionCode.ERROR_DECODING_JWT);
return CompletableFuture.failedFuture(e);
}
return authenticateToken(token)
.whenComplete((jwt, e) -> {
if (jwt != null) {
AuthenticationMetrics.authenticateSuccess(getClass().getSimpleName(), getAuthMethodName());
}
// Failure metrics are incremented within methods above
});
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for your explanation.
@@ -447,7 +447,7 @@ DecodedJWT verifyJWT(PublicKey publicKey, | |||
} | |||
|
|||
static void incrementFailureMetric(AuthenticationExceptionCode code) { | |||
AuthenticationMetrics.authenticateFailure(SIMPLE_NAME, "token", code.toString()); | |||
AuthenticationMetrics.authenticateFailure(SIMPLE_NAME, "openid", code); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@michaeljmarshall I changed the method name label to openid
, what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary because the SIMPLE_NAME
is the class's name, which has openid
in it. I wonder if we really even need the auth method if we also have the class name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I changed to use the AUTH_METHOD_NAME
(token). Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing my feedback @gaoran10. Would you mind getting the build to pass? Then I'll re-review it. Thanks!
/pulsarbot rerun-failure-checks |
@michaeljmarshall when changes are requested, the build in upstream won't run at all. I'll dismiss the "changes requested" to get the tests running. |
/pulsarbot rerun-failure-checks |
I should have just added the "ready-to-test" label to the PR instead of dismissing the review. I forgot that it's the option that I should have used. |
/pulsarbot rerun-failure-checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov Report
@@ Coverage Diff @@
## master #20030 +/- ##
=============================================
+ Coverage 33.17% 72.92% +39.74%
- Complexity 12236 31967 +19731
=============================================
Files 1499 1868 +369
Lines 114413 138590 +24177
Branches 12431 15246 +2815
=============================================
+ Hits 37962 101071 +63109
+ Misses 71499 29461 -42038
- Partials 4952 8058 +3106
Flags with carried forward coverage won't be shown. Click here to find out more.
|
(cherry picked from commit 2b515ff)
(cherry picked from commit 2b515ff)
(cherry picked from commit 2b515ff)
(cherry picked from commit 2b515ff)
As discussed on the mailing list https://lists.apache.org/thread/w4jzk27qhtosgsz7l9bmhf1t7o9mxjhp, there is no plan to release 2.9.6, so I am going to remove the release/2.9.6 label |
Motivation
The
reason
label of authentication metrics is a potential infinity string.The PR #19758 also tries to fix this problem.
Use an enum ErrorCode as the authentication failure reason.
Modifications
Add an enum class
ErrorCode
for Pulsar authentication, it indicates the authentication failure reason.Add a new exception
PulsarAuthenticationException
which contains the error code.Verifying this change
Change the test to match this change.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete
Matching PR in forked repository
PR in forked repository: gaoran10#26