Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix the reason label of authentication metrics #20030

Merged
merged 6 commits into from
May 5, 2023

Conversation

gaoran10
Copy link
Contributor

@gaoran10 gaoran10 commented Apr 6, 2023

Motivation

The reason label of authentication metrics is a potential infinity string.
The PR #19758 also tries to fix this problem.

Use an enum ErrorCode as the authentication failure reason.

Modifications

Add an enum class ErrorCode for Pulsar authentication, it indicates the authentication failure reason.
Add a new exception PulsarAuthenticationException which contains the error code.

Verifying this change

Change the test to match this change.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: gaoran10#26

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Apr 6, 2023
@gaoran10 gaoran10 self-assigned this Apr 6, 2023

try {
if (users.get(userId) == null) {
throw new AuthenticationException(msg);
throw new PulsarAuthenticationException(msg, errorCode);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, use ErrorCode.BASIC_INVALID_AUTH_DATA directly will be more clear. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll fix this.

@mattisonchao
Copy link
Member

@michaeljmarshall would you mind to taking a look here?

@codelipenghui codelipenghui added the release/important-notice The changes which are important should be mentioned in the release note label Apr 10, 2023
@codelipenghui codelipenghui added this to the 3.0.0 milestone Apr 10, 2023
@codelipenghui codelipenghui added release/2.11.2 release/2.10.5 release/2.9.6 doc-required Your PR changes impact docs and you will update later. labels Apr 10, 2023
@github-actions github-actions bot removed the doc-required Your PR changes impact docs and you will update later. label Apr 10, 2023
Copy link
Contributor

@codelipenghui codelipenghui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good me.
And I have added the release labels. Since it's a metrics changes that might impact the Pulsar operators, maybe not, I don't know someone will add alert rules or something based on the long reason message. But it still worth to start a discussion for cherry-picking the changes to release branches.

@codelipenghui
Copy link
Contributor

And please also update the documentation after the RP get merged.

@codelipenghui
Copy link
Contributor

/pulsarbot run-failure-checks

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @gaoran10, I just noticed this problem last week. It's definitely one we need to fix.

My primary concern is with the PulsarAuthenticationException class because it isn't really extensible for custom implementations of the AuthenticationProvider interface.

I also wonder if it would be better to add a new class that makes it so we can avoid each implementation from needing to increment the metrics on its own. As long as a provider throws a PulsarAuthenticationException, we could increment a metric in the broker/proxy/function worker/websocket proxy.

Let me know what you think. Thanks!

@@ -157,8 +158,7 @@ public String authenticate(AuthenticationDataSource authData) throws Authenticat
AuthenticationMetrics.authenticateSuccess(getClass().getSimpleName(), getAuthMethodName());
return role;
} catch (AuthenticationException exception) {
AuthenticationMetrics.authenticateFailure(getClass().getSimpleName(), getAuthMethodName(),
exception.getMessage());
AuthenticationMetrics.authenticateFailure(getClass().getSimpleName(), getAuthMethodName(), exception);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be valuable to increment the metric outside of the implementations of the AuthenticationProvider so that all custom implementations can integrate with this feature.

Copy link
Contributor Author

@gaoran10 gaoran10 Apr 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I think we can calculate the metric in the AuthenticationService.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it's hard to find a unified entrance to increment the authentication metric. Maybe we can try to improve this in another PR. WDYT

Comment on lines 29 to 46
public enum ErrorCode {
UNKNOWN,
PROVIDER_LIST_AUTH_REQUIRED,
BASIC_INVALID_TOKEN,
BASIC_INVALID_AUTH_DATA,
AUTHZ_NO_CLIENT,
AUTHZ_NO_TOKEN,
AUTHZ_NO_PUBLIC_KEY,
AUTHZ_DOMAIN_MISMATCH,
AUTHZ_INVALID_TOKEN,
TLS_NO_CERTS,
TLS_NO_CN, // cn: common name
TOKEN_INVALID_HEADER,
TOKEN_NO_AUTH_DATA,
TOKEN_EMPTY_TOKEN,
TOKEN_INVALID_TOKEN,
TOKEN_INVALID_AUDIENCES,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary concern with relying on this enum is that it is not easily extended. For example, I just merged the OIDC Provider with the following enum:

public enum AuthenticationExceptionCode {
UNSUPPORTED_ISSUER,
UNSUPPORTED_ALGORITHM,
ISSUER_MISMATCH,
ALGORITHM_MISMATCH,
INVALID_PUBLIC_KEY,
ERROR_RETRIEVING_PROVIDER_METADATA,
ERROR_RETRIEVING_PUBLIC_KEY,
ERROR_DECODING_JWT,
ERROR_VERIFYING_JWT,
ERROR_VERIFYING_JWT_SIGNATURE,
INVALID_JWT_CLAIM,
EXPIRED_JWT,
}

From your list and from mine, my primary conclusion is that the types of errors are often provider specific, so a static enum that cannot be extended feels restrictive.

I agree that it is extremely important to limit the cardinality of these metrics, but I also think we want to make it easily extensible by users.

I think an alternative solution could work by making PulsarAuthenticationException abstract and letting each provider implement their own exception. Then, the provider itself could have its own enum, and when the framework calls getErrorCode, the implementation will convert its own enum into a string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea! I'll change this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add an abstract class BaseAuthenticationException, which extends AuthenticationException and has an abstract method public abstract Enum<?> getErrorCode(), I think developers can custom their exceptions based on the BaseAuthenticationException, the custom exception can provide the custom error code. PTAL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My primary concern is that ErrorCode has many enums that are specific to each provider and are not generally useful. What do you think about the following class?

    public class PulsarAuthenticationException extends AuthenticationException {

        private final String errorCode;

        // Warning, the errorLabel must not have high cardinality
        public PulsarAuthenticationException(String message, String errorLabel) {
            super(message);
            this.errorLabel = errorLabel;
        }

        public String getErrorLabel() {
            return errorLabel;
        }

    }

I think it is reasonable to expect maintainers to know that the errorLabel must not be high cardinality, and this way, we don't have enumerations that combine unrelated labels. If a provider wants to use an enum for its own tracking, that would be perfectly reasonable, and it would probably work just like the AuthenticationProviderOpenID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the PulsarAuthenticationException, and only added some error codes for every authentication provider, like AuthenticationProviderOpenID . PTAL

Comment on lines 63 to 64
public static void authenticateFailure(String providerName, String authMethod,
AuthenticationException exception) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is a generic metrics class meant to be useful to all authentication provider implementations, I think we should provide an easy way to provide custom "reasons". I respect that it is easy to provide a reason that is high cardinality, but I think we should expect each implementation to handle those details.

Copy link
Contributor Author

@gaoran10 gaoran10 Apr 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I add an abstract class BaseAuthenticationException. PTAL

@cbornet cbornet changed the title [fix][metrics] Fix the reason label of authentication metrics [fix][broker] Fix the reason label of authentication metrics Apr 11, 2023
@cbornet cbornet modified the milestones: 3.0.0, 3.1.0 Apr 11, 2023
@gaoran10
Copy link
Contributor Author

@michaeljmarshall PTAL

@gaoran10
Copy link
Contributor Author

/pulsarbot run-failure-checks

@gaoran10 gaoran10 force-pushed the fix-authentication-reason-label branch from e434b54 to 6612405 Compare April 24, 2023 02:47
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes @gaoran10. I left a couple comments.

@@ -89,7 +89,7 @@ public class AuthenticationProviderOpenID implements AuthenticationProvider {
private static final String SIMPLE_NAME = AuthenticationProviderOpenID.class.getSimpleName();

// Must match the value used by the OAuth2 Client Plugin.
private static final String AUTH_METHOD_NAME = "token";
private static final String AUTH_METHOD_NAME = "openid";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be token in order to support the AuthenticationProviderList. Note that the oauth2 client plugin uses token as well. Also, it is valid to use the token authentication data provider in cases where the oauth2 token is fetched for you, like the k8s service account token.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revert this change.

try {
// Get Token
String token;
token = getToken(authData);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to increment the INVALID_AUTH_DATA error in the getToken method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method getToken is static, some methods use it and catch exceptions to record failure metrics, such as OpenID provider, if we increment the INVALID_AUTH_DATA error in the getToken method, this will generate repeat failure records.

OpenID provider

    CompletableFuture<DecodedJWT> authenticateTokenAsync(AuthenticationDataSource authData) {
        String token;
        try {
            token = AuthenticationProviderToken.getToken(authData);
        } catch (AuthenticationException e) {
            incrementFailureMetric(AuthenticationExceptionCode.ERROR_DECODING_JWT);
            return CompletableFuture.failedFuture(e);
        }
        return authenticateToken(token)
                .whenComplete((jwt, e) -> {
                    if (jwt != null) {
                        AuthenticationMetrics.authenticateSuccess(getClass().getSimpleName(), getAuthMethodName());
                    }
                    // Failure metrics are incremented within methods above
                });
    }

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for your explanation.

@@ -447,7 +447,7 @@ DecodedJWT verifyJWT(PublicKey publicKey,
}

static void incrementFailureMetric(AuthenticationExceptionCode code) {
AuthenticationMetrics.authenticateFailure(SIMPLE_NAME, "token", code.toString());
AuthenticationMetrics.authenticateFailure(SIMPLE_NAME, "openid", code);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaeljmarshall I changed the method name label to openid, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary because the SIMPLE_NAME is the class's name, which has openid in it. I wonder if we really even need the auth method if we also have the class name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed to use the AUTH_METHOD_NAME(token). Thanks.

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my feedback @gaoran10. Would you mind getting the build to pass? Then I'll re-review it. Thanks!

@gaoran10 gaoran10 closed this May 4, 2023
@gaoran10 gaoran10 reopened this May 4, 2023
@gaoran10
Copy link
Contributor Author

gaoran10 commented May 4, 2023

/pulsarbot rerun-failure-checks

@lhotari
Copy link
Member

lhotari commented May 4, 2023

Thanks for addressing my feedback @gaoran10. Would you mind getting the build to pass? Then I'll re-review it. Thanks!

@michaeljmarshall when changes are requested, the build in upstream won't run at all. I'll dismiss the "changes requested" to get the tests running.

@lhotari lhotari dismissed michaeljmarshall’s stale review May 4, 2023 07:48

to run CI pipeline

@lhotari
Copy link
Member

lhotari commented May 4, 2023

/pulsarbot rerun-failure-checks

@lhotari
Copy link
Member

lhotari commented May 4, 2023

Thanks for addressing my feedback @gaoran10. Would you mind getting the build to pass? Then I'll re-review it. Thanks!

@michaeljmarshall when changes are requested, the build in upstream won't run at all. I'll dismiss the "changes requested" to get the tests running.

I should have just added the "ready-to-test" label to the PR instead of dismissing the review. I forgot that it's the option that I should have used.

@michaeljmarshall
Copy link
Member

/pulsarbot rerun-failure-checks

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov-commenter
Copy link

Codecov Report

Merging #20030 (613f2c6) into master (9b72302) will increase coverage by 39.74%.
The diff coverage is 67.60%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #20030       +/-   ##
=============================================
+ Coverage     33.17%   72.92%   +39.74%     
- Complexity    12236    31967    +19731     
=============================================
  Files          1499     1868      +369     
  Lines        114413   138590    +24177     
  Branches      12431    15246     +2815     
=============================================
+ Hits          37962   101071    +63109     
+ Misses        71499    29461    -42038     
- Partials       4952     8058     +3106     
Flag Coverage Δ
inttests 24.17% <0.00%> (?)
systests 24.77% <21.81%> (?)
unittests 72.21% <67.60%> (+39.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...e/pulsar/broker/authentication/oidc/JwksCache.java 56.62% <0.00%> (ø)
...thentication/oidc/OpenIDProviderMetadataCache.java 78.48% <0.00%> (ø)
...sar/broker/authorization/AuthorizationService.java 57.20% <0.00%> (+47.20%) ⬆️
...er/authentication/AuthenticationProviderBasic.java 59.09% <50.00%> (+59.09%) ⬆️
...r/authentication/AuthenticationProviderAthenz.java 68.57% <68.75%> (ø)
...ker/authentication/AuthenticationProviderList.java 48.66% <71.42%> (+48.66%) ⬆️
...er/authentication/AuthenticationProviderToken.java 92.35% <80.00%> (+84.89%) ⬆️
...oker/authentication/AuthenticationProviderTls.java 83.33% <87.50%> (+29.16%) ⬆️
...hentication/oidc/AuthenticationProviderOpenID.java 75.86% <100.00%> (ø)
.../broker/authentication/AuthenticationProvider.java 72.72% <100.00%> (+32.72%) ⬆️
... and 1 more

... and 1535 files with indirect coverage changes

@gaoran10 gaoran10 merged commit 2b515ff into apache:master May 5, 2023
poorbarcode pushed a commit that referenced this pull request May 7, 2023
gaoran10 added a commit that referenced this pull request May 8, 2023
gaoran10 added a commit that referenced this pull request May 9, 2023
gaoran10 added a commit that referenced this pull request May 10, 2023
@gaoran10 gaoran10 added the cherry-picked/branch-2.9 Archived: 2.9 is end of life label May 11, 2023
nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request May 11, 2023
@michaeljmarshall
Copy link
Member

As discussed on the mailing list https://lists.apache.org/thread/w4jzk27qhtosgsz7l9bmhf1t7o9mxjhp, there is no plan to release 2.9.6, so I am going to remove the release/2.9.6 label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants