Skip to content

Handle fatal errors in receiver state machine#1060

Merged
spacebear21 merged 7 commits intopayjoin:masterfrom
spacebear21:handle-error-typestate
Oct 1, 2025
Merged

Handle fatal errors in receiver state machine#1060
spacebear21 merged 7 commits intopayjoin:masterfrom
spacebear21:handle-error-typestate

Conversation

@spacebear21
Copy link
Collaborator

@spacebear21 spacebear21 commented Sep 11, 2025

This brings fatal error handling into the receiver state machine.

Supersedes #1045

TODOs:

  • HasError::process_error_response should terminally close the session if the response is successfuly, or record a fatal error and move to terminal state if another error request should not be attempted (e.g. expired session or bad reply key).
  • handle_fatal_reject shouldn't close sessions, instead it should only close after an error response has been processed.
  • SessionEvent::TerminalFailure holds a JsonReply because they are serializable, but this results in a hacky workaroundfor recording Expired sessions since those shouldn't actually be replied to.
  • Remove SessionHistory::extract_err_req
  • Implement session.fail() and session.cancel() to allow receiver to explicitly close a session. Maybe fail() should take a Error parameter and transition to HasReplyableError state if appropriate? see discussion here Fix apply_fee_range error handling #1108 (review)
Pull Request Checklist

Please confirm the following before requesting review:

Copy link
Collaborator

@arminsabouri arminsabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approach ACK.
Just had some high level questions. In general this is moving in the right direction

/// Represents a fatal rejection of a state transition.
/// When this error occurs, the session must be closed and cannot be resumed.
pub struct RejectFatal<Event, Err>(Event, Err);
pub struct RejectFatal<Event, Err>(pub(crate) Event, pub(crate) Err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this pub(crate)'d for tests?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, specifically to match on the rejection details

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would pub(self) work?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any decision on whether pub(self) or pub(crate) is better here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is a nit.

Comment on lines 996 to 1008
&mut self,
&mut self, // NOTE: does this need to be &mut? e.g. extract_err_req just clones the inner keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely easier to deal with if not mut. I believe the concern is &mut self.session_context.ohttp_keys is supposed to get mutated if it ohttp_encapsulate ratchets instead of using randomness so that the same OHTTP ephemeral key doesn't get used twice which would allow a relay to correlate requests since they'd be encrypted with the same key & nonce (base nonce + seq).

I'd like @nothingmuch to confirm.

@DanGould DanGould mentioned this pull request Sep 12, 2025
2 tasks
@spacebear21 spacebear21 force-pushed the handle-error-typestate branch 2 times, most recently from 141b916 to 620c509 Compare September 15, 2025 17:06
@arminsabouri
Copy link
Collaborator

From our devcall:
SessionEvent::Close should take a Reason as an argument. Instead of the reason being derived from previous session event.

We should consider adding a pub abort and private close(). In general it should be a hard to be in limbo where we close but dont push an event or vice-versa

@arminsabouri arminsabouri mentioned this pull request Sep 17, 2025
2 tasks
@spacebear21 spacebear21 mentioned this pull request Sep 17, 2025
2 tasks
@spacebear21 spacebear21 force-pushed the handle-error-typestate branch 2 times, most recently from 9e52f2e to 779e172 Compare September 18, 2025 02:07
@coveralls
Copy link
Collaborator

coveralls commented Sep 18, 2025

Pull Request Test Coverage Report for Build 18170784561

Details

  • 210 of 268 (78.36%) changed or added relevant lines in 4 files are covered.
  • 8 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.2%) to 84.561%

Changes Missing Coverage Covered Lines Changed/Added Lines %
payjoin/src/core/persist.rs 44 51 86.27%
payjoin-cli/src/app/v2/mod.rs 1 14 7.14%
payjoin/src/core/receive/v2/mod.rs 101 139 72.66%
Files with Coverage Reduction New Missed Lines %
payjoin/src/core/receive/v2/mod.rs 2 90.43%
payjoin-cli/src/app/v2/mod.rs 3 60.57%
payjoin/src/core/persist.rs 3 94.94%
Totals Coverage Status
Change from base Build 18168561417: -0.2%
Covered Lines: 8654
Relevant Lines: 10234

💛 - Coveralls

@spacebear21
Copy link
Collaborator Author

Re-requesting review because I believe all the main logic changes are in place and the overall design shouldn't change much unless something clearly wrong comes up in review. Leaving in draft status because some cleanup is needed (e.g. extract_err_req from SessionHistory and remove pub process_err_res pure function since it should only be called in HasReplyableError), and it will need additional testing.

Copy link
Collaborator

@arminsabouri arminsabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial ApporachACK.
I mostly had highlevel questions. Mainly around:

  • The necessity for a session event trait
  • How the typestate progresses to the reply error state

/// Represents a fatal rejection of a state transition.
/// When this error occurs, the session must be closed and cannot be resumed.
pub struct RejectFatal<Event, Err>(Event, Err);
pub struct RejectFatal<Event, Err>(pub(crate) Event, pub(crate) Err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would pub(self) work?

@spacebear21 spacebear21 mentioned this pull request Sep 29, 2025
@DanGould DanGould mentioned this pull request Sep 29, 2025
2 tasks
@spacebear21 spacebear21 force-pushed the handle-error-typestate branch from fa845c8 to ca5da9c Compare September 30, 2025 01:30
@spacebear21 spacebear21 marked this pull request as ready for review September 30, 2025 14:39
@spacebear21
Copy link
Collaborator Author

spacebear21 commented Sep 30, 2025

The biggest change since the last round of review is ca5da9c which introduces a new state transition method for replyable errors resulting in the HasReplyableError typestate. This approach replaces the SessionEventTrait approach taken previously, so instead of introspecting on the SessionEvent the caller just calls the appropriate transition method. I used Claude Code opus+sonnet to plan and write the majority of the code in that commit.

@spacebear21 spacebear21 mentioned this pull request Sep 30, 2025
2 tasks
Copy link
Collaborator

@benalleng benalleng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack on ca5da9c, Just a few more general clarifying questions. As well the terminal_error TODO can be ignored as we are getting rid of that method entirely, but there still remains the added TODO in the test_err_response in the integrations tests, I'm assuming that can safely be handled after 1.0 in a followup

@spacebear21 spacebear21 force-pushed the handle-error-typestate branch from ca5da9c to dcfe2f4 Compare September 30, 2025 18:10
@spacebear21
Copy link
Collaborator Author

there still remains the added TODO in the test_err_response in the integrations tests, I'm assuming that can safely be handled after 1.0 in a followup

Yeah this should be getting addressed in #1114

Copy link
Collaborator

@arminsabouri arminsabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CodeACK dcfe2f4
Just had some questions and noted follow up items 🚀

Perhaps we first review and merge #1114, resolve the TODO in the integration test so we can get more e2e coverage on the changes in this PR.

/// Represents a fatal rejection of a state transition.
/// When this error occurs, the session must be closed and cannot be resumed.
pub struct RejectFatal<Event, Err>(Event, Err);
pub struct RejectFatal<Event, Err>(pub(crate) Event, pub(crate) Err);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this is a nit.

pub struct RejectTransient<Err>(pub(crate) Err);
/// Represents a replyable error that transitions to an error state but keeps the session open.
/// When this error occurs, the session transitions to the ErrorState.
pub struct RejectReplyableError<Event, ErrorState, Err>(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to follow up with some unit testing for this

Copy link
Contributor

@DanGould DanGould left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think this works. Wild how few changes actually end up in payjoin-cli's implementation after all this internal jostling. Overall it seems like a much more realistic abstraction of reality. It does seem that tests could be a little tighter if InternalPersistedError could be matched on.

I'm also still a bit bugged out by that name since Transient InternalPersistedErrors are not Persisted. They're PersistErrors which arise from persistence attempts, not errors that are persisted by their nature.

Comment on lines +695 to 699
async fn handle_error(
&self,
ohttp_relay: &payjoin::Url,
session_history: &SessionHistory,
session: Receiver<HasReplyableError>,
persister: &ReceiverPersister,
) -> Result<()> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than an Err(anyhow::Error) being propagated if post_request or process_error_response fails, are there circumstances where we want to close the session in this function? for example, if process_error_response fails? Or if the post_request fails for one of a few enumerated reasons as a follow up with the session.fail() implementation?

.check_broadcast_suitability(None, |_| Ok(false))
.save(&persister)
.expect_err("Unbroadcastable should error");
// NOTE: it would be good to assert against the internal error type but InternalPersistedError is private
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InternalPersistedError is pub(crate).

Would making the internal field in PersistedError pub(super) (to the core scope) address this concern?

#[derive(Debug)]
pub struct PersistedError<
    ApiError: std::error::Error,
    StorageError: std::error::Error,
    ErrorState: fmt::Debug = (),
>(pub(super) InternalPersistedError<ApiError, StorageError, ErrorState>);

Comment on lines -633 to -637
events: vec![
SessionEvent::Created(SHARED_CONTEXT.clone()),
SessionEvent::CheckedBroadcastSuitability(),
SessionEvent::SessionInvalid(mock_err.0.clone(), Some(mock_err.1.clone())),
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change the static vec! definition to mut & .push() instead? nit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I don't think I changed that; this old test got deleted and the new test_session_transient_error is being displayed side by side but it's unrelated. I think I just copied to mut events approach from a different test, we seem to mix and match already.

DanGould and others added 7 commits October 1, 2025 12:38
This session event holds the JSON representation of the error reply to
the sender that should be attempted by the receiver, as specified in
BIP78.

Co-authored-by: spacebear <git@spacebear.dev>
This enforces proper handling within the receiver state machine, such
that they must attempt replying to the sender with an error response
before the session can be considered closed.

Co-authored-by: DanGould <d@ngould.dev>
`TerminalFailure` was never a real state in the receiver state machine.
Instead, a SessionEvent::Closed simply results in an empty `()` next
state.
Previously extract_err_req was exposed as a public method on
SessionHistory, and process_err_res as a pure function, for implementers
to manually process error responses.

Now this can and should be done via the HasReplyableError typestate, so
these functions are no longer needed.
Only a subset of fatal errors should result in immediate closure of
areceiver session. In many cases an attempt should first be made
torespond to the sender with an error as specified in BIP78.

Instead of scrutinizing various error types to determine whether
itshould be replyable or not, the callsite can simply call the
appropriate state transition method: `replyable_error()` for replyable
errors or `fatal()` for non-replyable ones. For replyable errors, the
resulting HasReplyableError typestate can be obtained from the resulting
`PersistedError` with the `error_state()` method.
@spacebear21 spacebear21 force-pushed the handle-error-typestate branch from dcfe2f4 to 9cdb1eb Compare October 1, 2025 17:45
@spacebear21
Copy link
Collaborator Author

spacebear21 commented Oct 1, 2025

Latest push is a simple rebase on the latest master. There is an outstanding task in the PR description for the fail()/cancel() methods that I will address in a follow-up.

Copy link
Collaborator

@arminsabouri arminsabouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-Ack 9cdb1eb

@spacebear21 spacebear21 merged commit 4d82e53 into payjoin:master Oct 1, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants