Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Capture CassandraLWT error and log/bump metrics for it. #4888

Merged
merged 7 commits into from
Jul 18, 2022

Conversation

ZackLK
Copy link
Contributor

@ZackLK ZackLK commented Jun 29, 2022

What changed?
Capturing a specific error from Cassandra by error code and substring, then publishing metrics & logging when we find it.

Why?

How did you test it?
Hasn't been tested yet - looking for feedback on how to unit test this that would provide value, or we can test in staging.

Potential risks

Release notes

Documentation Changes

@ZackLK ZackLK requested review from emrahs and a team June 29, 2022 23:41
@@ -76,6 +76,7 @@ type (
IsTimeoutError(error) bool
IsNotFoundError(error) bool
IsThrottlingError(error) bool
IsCassandraLWTError(error) bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cassandra-specific func in the nosql interface doesn't look good. Can you give this a more generic name so that it looks less ugly for non-Cassandra nosql plugins? Perhaps something like IsDBUnavailableError.

@@ -266,6 +266,12 @@ type (
Msg string
}

// CassandraLWTError is returned when we get UNAVAILABLE error code from Cassandra with message containing
// "Cannot perform LWT operation"
CassandraLWTError struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should keep this file DB-agnostic, so I'd prefer to use more generic names like DBUnavailableError which could be meaningful for plugins such as MongoDB.

Also see my similar comment in common/persistence/nosql/nosqlplugin/interfaces.go.

@ZackLK
Copy link
Contributor Author

ZackLK commented Jul 11, 2022

I've renamed the places you mentioned. Couple of questions:

  • Do we want to rename the metric too?
  • Do we want to change the logic to not check the substring of "Cannot perform LWT operation" and instead just use the 0x1000 (UNAVAILABLE) code? Or are there reasons not to do this, like it might pick up other UNAVAILABLE return codes?

@ZackLK ZackLK requested a review from emrahs July 11, 2022 21:45
@davidporter-id-au
Copy link
Contributor

@ZackLK RE the metrics, I would, it'd be super confusing to have a metric called cassandra-error-XYZ and have it being emitted from some other random database. Particularly as we may query across clusters

@ZackLK ZackLK requested a review from a team July 18, 2022 16:32
@ZackLK ZackLK enabled auto-merge (squash) July 18, 2022 17:10
@ZackLK ZackLK merged commit b0d1f06 into master Jul 18, 2022
@ZackLK ZackLK deleted the cassandra_error_metric branch July 18, 2022 18:17
@coveralls
Copy link

Pull Request Test Coverage Report for Build 01821249-0ba6-47c9-b80d-4969a19af023

  • 0 of 28 (0.0%) changed or added relevant lines in 6 files are covered.
  • 47 unchanged lines in 4 files lost coverage.
  • Overall coverage increased (+0.06%) to 57.742%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/persistence/dataManagerInterfaces.go 0 3 0.0%
common/persistence/nosql/nosqlplugin/cassandra/db.go 0 3 0.0%
common/persistence/nosql/nosqlplugin/mongodb/error.go 0 3 0.0%
common/persistence/persistenceMetricClients.go 0 4 0.0%
common/persistence/nosql/utils.go 0 5 0.0%
common/persistence/nosql/nosqlplugin/cassandra/gocql/public/client.go 0 10 0.0%
Files with Coverage Reduction New Missed Lines %
common/persistence/nosql/nosqlplugin/cassandra/workflow.go 3 56.69%
service/history/task/transfer_standby_task_executor.go 4 86.75%
common/persistence/nosql/nosqlplugin/cassandra/workflowParsingUtils.go 20 79.16%
service/history/task/task_util.go 20 70.57%
Totals Coverage Status
Change from base Build 01820554-8b82-4cbc-907b-cc3a1d8e9ad8: 0.06%
Covered Lines: 83839
Relevant Lines: 145196

💛 - Coveralls

abhishekj720 pushed a commit to abhishekj720/cadence that referenced this pull request Jul 21, 2022
Capture DB Unavailable errors and log/bump metrics for it.
Comment on lines +52 to +57
if errChecker.IsDBUnavailableError(err) {
return &p.DBUnavailableError{
Msg: fmt.Sprintf("#{operation} operation failed. Error: #{err}"),
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

worth fixing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants