-
Notifications
You must be signed in to change notification settings - Fork 241
[metrics] [dataexchange] [networkmap] DXConnect Callbacks for Node Identity Check Metrics #1652
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[metrics] [dataexchange] [networkmap] DXConnect Callbacks for Node Identity Check Metrics #1652
Conversation
2f8bd7f to
69cf0c9
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1652 +/- ##
========================================
Coverage 99.95% 99.95%
========================================
Files 337 339 +2
Lines 29632 29782 +150
========================================
+ Hits 29620 29770 +150
Misses 8 8
Partials 4 4 ☔ View full report in Codecov by Sentry. |
peterbroadhurst
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great bit of work @onelapahead 👍
I have popped a bunch of questions in there for you to go through before we close out on this if that's ok. Some might not require changes.
The most important is regarding safety of introducing a local node DID lookup on various paths.
internal/dataexchange/ffdx/ffdx.go
Outdated
|
|
||
| // if this occurs, it is either a misconfigured / broken DX or likely a DX that is compatible from an API perspective | ||
| // but does not have the same peer info as the HTTPS mTLS DX | ||
| if dxPeer.GetString("cert") == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand why this result is ending up with newline characters in here. Might come back and try and understand by looking at the code, as it seems like a smell that we're doing string/JSON processing wrong somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our APIs as JSON return/expect these certs to have the literal \n characters in them, otherwise the strings are not valid for JSON blobs. So its definitely smelly and awkward, but thats how they have always worked.
We've never had FF aware code of the cert field in the profile up until now, but it reflects whats returned by the DX plugin's /api/v1/id endpoint and as to remain compatible with that. Whenever you're PATCHing the node's identity during a cert rotation you have to take a PEM bundle and do a replace of the newlines as a result.
In the future, we should be handling embedded PEM/file blobs in JSON objects as base64 encoded strings to avoid this awkwardness.
internal/dataexchange/ffdx/ffdx.go
Outdated
| if err != nil { | ||
| return time.Time{}, fmt.Errorf("failed to parse non-certificate within bundle: %v", err) | ||
| } | ||
| if leafCert == nil || cert.NotAfter.Before(leafCert.NotAfter) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this code is asserting that order of multiple certificates in the chain is always with the most leaf-like as the last certificate?
Don't have in my mind how this package is generated used, but interested to know if that's something that all cloud automations could/should be required to do.
Would we simply not want the earliest expiry in the package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we simply not want the earliest expiry in the package?
we definitely want that, and thats what this code is doing. Calling it leafCert is misleading. But the NotAfter / Before are time operations, so we're essentially looking for the cert in the stack who's NotAfter is the soonest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've renamed the vars, left some comments, and updated one of the tests to reverse the order of certs in a bundle to ensure the soonest expiry is always chosen and that's clearer thats what the code is intended to do
e223208 to
b91652d
Compare
…de Identity Check Metrics Signed-off-by: hfuss <hayden.fuss@kaleido.io> renaming dxconnect callback since it isnt an event persay Signed-off-by: hfuss <hayden.fuss@kaleido.io> fix linter and tests Signed-off-by: hfuss <hayden.fuss@kaleido.io> register node sets metric too Signed-off-by: hfuss <hayden.fuss@kaleido.io> cleaned up code and tests Signed-off-by: hfuss <hayden.fuss@kaleido.io> getting test coverage up Signed-off-by: hfuss <hayden.fuss@kaleido.io> nearly all test coverage Signed-off-by: hfuss <hayden.fuss@kaleido.io> 100% coverage Signed-off-by: hfuss <hayden.fuss@kaleido.io> checking nodeidentitystatus on confirmed identity changes and orchestrator startup Signed-off-by: hfuss <hayden.fuss@kaleido.io> nearly full coverage Signed-off-by: hfuss <hayden.fuss@kaleido.io> 100% coverage, again Signed-off-by: hfuss <hayden.fuss@kaleido.io> trigger test Signed-off-by: hfuss <hayden.fuss@kaleido.io> PR feedback Signed-off-by: hfuss <hayden.fuss@kaleido.io> extend timeout to 45s Signed-off-by: hfuss <hayden.fuss@kaleido.io>
b91652d to
a5cb23a
Compare
EnriqueL8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this contribution @onelapahead! A few TODOs in there if you can resolve in a follow up
| return nil, i18n.NewError(ctx, coremsgs.Msg404NoResult) | ||
| } | ||
|
|
||
| // TODO is this right ? code below assumes this is true and errors otherwise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is right as it's one of the things that can be updated.
| registry.MustRegister(NodeIdentityDXCertExpiryGauge) | ||
| } | ||
|
|
||
| // TODO should this type live elsewhere ?? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this should be exposed in the /status/multiparty API we recently added https://hyperledger.github.io/firefly/latest/swagger/#/Default%20Namespace/getStatusMultiparty
Proposed changes
Private messaging can be easily disrupted if the identity broadcasted DX certificates are out-of-sync with the certificates actually in-use by a DX plugin. This can be true for both the OSS plugins and other proprietary DX plugin implementations. We've documented this procedure, and are considering further automating this procedure within FireFly or auxiliary tooling in future releases.
For now, FireFly has all the means to monitor this and make it known to the administrators running it via metrics and logs - so that they can use their judgement on whether to update a profile on-chain, or rollback their DX certs, etc.
This PR adds a new
CheckNodeIdentityStatusto theNetworkMap, aDXConnectcallback to the DX plugin, and aCheckNodeIdentityStatuson the DX plugin. These methods and callbacks allow for either the networkmap or the definition handlers to check the status whenever either the DX plugin changes its cert, or the admins change the node's identity state. The DX plugin itself is namespace-less and has no concept of what local network nodes might be using it and what their config is, so it has to be provided a node identity from a namespace resource (networkmap, definition handlers) to reconcile it with its own expectations of what "status" is, relative to the DX's configuration.The following cases are covered to ensure the metrics below are as accurate as possible:
DXConnectcallback, theOrchestratorcan be notified of the new connection, and ask theNetworkMapto do the status checks. Often, the first connection is during namespace initialization, and that fails bc the namespace isn't started, see case 2.Orcehstrator, it calls the networkmap'sCheckNodeIdentityStatusto ensure its up-to-date since the first call failed while the namespace was started.New metrics
ff_multiparty_node_identity_dx_mismatchandff_multiparty_node_identity_dx_expiry_epochare added so that theNetworkMapcan then publish these status findings as gauges that can be easily monitoring by a timeseries and alerting system.Additional changes
profileis not set on aPATCH /identitites/{iid}as we found the code currently assumes aprofileobject is always provided and if its not that causes a nil error / connection abruptly closednslabel so our users can monitoring the different metrics from the perspective of the various namespaces they may be running. NOTE: for Prometheus and other timeseries monitoring systems this could greatly increase the cardinality of the metrics being produced if there are a lot of namespaces running. But this is necessary to be able to identify if message, etc. are failing or pending for a particular namespace vs. another w/o inspecting logs.Types of changes
Please make sure to follow these points
< issue name >egAdded links in the documentation.Screenshots (If Applicable)
TODO
Other Information
TODO