Skip to content

Conversation

@onelapahead
Copy link
Contributor

@onelapahead onelapahead commented Feb 28, 2025

Proposed changes

Private messaging can be easily disrupted if the identity broadcasted DX certificates are out-of-sync with the certificates actually in-use by a DX plugin. This can be true for both the OSS plugins and other proprietary DX plugin implementations. We've documented this procedure, and are considering further automating this procedure within FireFly or auxiliary tooling in future releases.

For now, FireFly has all the means to monitor this and make it known to the administrators running it via metrics and logs - so that they can use their judgement on whether to update a profile on-chain, or rollback their DX certs, etc.

This PR adds a new CheckNodeIdentityStatus to the NetworkMap, a DXConnect callback to the DX plugin, and a CheckNodeIdentityStatus on the DX plugin. These methods and callbacks allow for either the networkmap or the definition handlers to check the status whenever either the DX plugin changes its cert, or the admins change the node's identity state. The DX plugin itself is namespace-less and has no concept of what local network nodes might be using it and what their config is, so it has to be provided a node identity from a namespace resource (networkmap, definition handlers) to reconcile it with its own expectations of what "status" is, relative to the DX's configuration.

The following cases are covered to ensure the metrics below are as accurate as possible:

  1. Whenever a DX reconnects, either due to FireFly starting up or DX starting up or network issues, FireFly sees if its needs to re-initialize the DX. On this new connection is a great moment to ask DX for its cert and compare it to what was broadcasted, additionally the DX cert can be examined to see when a soonest expiry of its cert chain is for further monitoring. Using the DXConnect callback, the Orchestrator can be notified of the new connection, and ask the NetworkMap to do the status checks. Often, the first connection is during namespace initialization, and that fails bc the namespace isn't started, see case 2.
  2. When starting, the Orcehstrator, it calls the networkmap's CheckNodeIdentityStatus to ensure its up-to-date since the first call failed while the namespace was started.
  3. Whenever an identity is claimed, we check to see if its the local node identity i.e registration, and if so call the status check.
  4. Whenever an identity is updated, we check to see if its the local node identity i.e. PATCH profile of a new cert, and if so call the status check.

New metrics ff_multiparty_node_identity_dx_mismatch and ff_multiparty_node_identity_dx_expiry_epoch are added so that the NetworkMap can then publish these status findings as gauges that can be easily monitoring by a timeseries and alerting system.

Additional changes

  • returns a 400 if profile is not set on a PATCH /identitites/{iid} as we found the code currently assumes a profile object is always provided and if its not that causes a nil error / connection abruptly closed
  • all existing metrics are given a ns label so our users can monitoring the different metrics from the perspective of the various namespaces they may be running. NOTE: for Prometheus and other timeseries monitoring systems this could greatly increase the cardinality of the metrics being produced if there are a lot of namespaces running. But this is necessary to be able to identify if message, etc. are failing or pending for a particular namespace vs. another w/o inspecting logs.

Types of changes

  • Bug fix
  • New feature added
  • Documentation Update
  • New metrics added / updated

Please make sure to follow these points

  • I have read the contributing guidelines.
  • I have performed a self-review of my own code or work.
  • I have commented my code, particularly in hard-to-understand areas.
  • My changes generates no new warnings.
  • My Pull Request title is in format < issue name > eg Added links in the documentation.
  • I have added tests that prove my fix is effective or that my feature works.
  • My changes have sufficient code coverage (unit, integration, e2e tests).

Screenshots (If Applicable)

TODO

Other Information

TODO

@codecov
Copy link

codecov bot commented Feb 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.95%. Comparing base (7f9364d) to head (a5cb23a).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #1652    +/-   ##
========================================
  Coverage   99.95%   99.95%            
========================================
  Files         337      339     +2     
  Lines       29632    29782   +150     
========================================
+ Hits        29620    29770   +150     
  Misses          8        8            
  Partials        4        4            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@onelapahead onelapahead marked this pull request as ready for review March 3, 2025 15:10
@onelapahead onelapahead requested a review from a team as a code owner March 3, 2025 15:10
Copy link
Contributor

@peterbroadhurst peterbroadhurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great bit of work @onelapahead 👍

I have popped a bunch of questions in there for you to go through before we close out on this if that's ok. Some might not require changes.

The most important is regarding safety of introducing a local node DID lookup on various paths.


// if this occurs, it is either a misconfigured / broken DX or likely a DX that is compatible from an API perspective
// but does not have the same peer info as the HTTPS mTLS DX
if dxPeer.GetString("cert") == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why this result is ending up with newline characters in here. Might come back and try and understand by looking at the code, as it seems like a smell that we're doing string/JSON processing wrong somewhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our APIs as JSON return/expect these certs to have the literal \n characters in them, otherwise the strings are not valid for JSON blobs. So its definitely smelly and awkward, but thats how they have always worked.

We've never had FF aware code of the cert field in the profile up until now, but it reflects whats returned by the DX plugin's /api/v1/id endpoint and as to remain compatible with that. Whenever you're PATCHing the node's identity during a cert rotation you have to take a PEM bundle and do a replace of the newlines as a result.

In the future, we should be handling embedded PEM/file blobs in JSON objects as base64 encoded strings to avoid this awkwardness.

if err != nil {
return time.Time{}, fmt.Errorf("failed to parse non-certificate within bundle: %v", err)
}
if leafCert == nil || cert.NotAfter.Before(leafCert.NotAfter) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this code is asserting that order of multiple certificates in the chain is always with the most leaf-like as the last certificate?

Don't have in my mind how this package is generated used, but interested to know if that's something that all cloud automations could/should be required to do.

Would we simply not want the earliest expiry in the package?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we simply not want the earliest expiry in the package?

we definitely want that, and thats what this code is doing. Calling it leafCert is misleading. But the NotAfter / Before are time operations, so we're essentially looking for the cert in the stack who's NotAfter is the soonest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed the vars, left some comments, and updated one of the tests to reverse the order of certs in a bundle to ensure the soonest expiry is always chosen and that's clearer thats what the code is intended to do

@onelapahead onelapahead force-pushed the certmisatch-gauages branch 2 times, most recently from e223208 to b91652d Compare March 3, 2025 21:42
…de Identity Check Metrics

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

renaming dxconnect callback since it isnt an event persay

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

fix linter and tests

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

register node sets metric too

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

cleaned up code and tests

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

getting test coverage up

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

nearly all test coverage

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

100% coverage

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

checking nodeidentitystatus on confirmed identity changes and orchestrator startup

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

nearly full coverage

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

100% coverage, again

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

trigger test

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

PR feedback

Signed-off-by: hfuss <hayden.fuss@kaleido.io>

extend timeout to 45s

Signed-off-by: hfuss <hayden.fuss@kaleido.io>
@peterbroadhurst peterbroadhurst merged commit e486b2e into hyperledger:main Mar 4, 2025
19 checks passed
@onelapahead onelapahead deleted the certmisatch-gauages branch March 4, 2025 13:40
Copy link
Contributor

@EnriqueL8 EnriqueL8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this contribution @onelapahead! A few TODOs in there if you can resolve in a follow up

return nil, i18n.NewError(ctx, coremsgs.Msg404NoResult)
}

// TODO is this right ? code below assumes this is true and errors otherwise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is right as it's one of the things that can be updated.

registry.MustRegister(NodeIdentityDXCertExpiryGauge)
}

// TODO should this type live elsewhere ??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this should be exposed in the /status/multiparty API we recently added https://hyperledger.github.io/firefly/latest/swagger/#/Default%20Namespace/getStatusMultiparty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants