Skip to content

Conversation

jinliu9508
Copy link
Contributor

@jinliu9508 jinliu9508 commented Jan 31, 2025

Description

One Line Summary

This PR fixes an issue where subscriptions do not update correctly and results in a 400 error when calling Login. It also handles this bad state when migrating from any previous version.

Details

Motivation

There is an edge case where the app successfully creates a user for the first time, but the request remains in the queue due to the PostCreateDelay. When this happens, all future Login calls return a 400 error, and any subscription updates fail. This PR prevents this scenario by correctly updating the operation queue and handling the bad state during migration.

Scope

  • OperationRepo: No longer holds up operation processing before the PostCreateDelay. Instead, the waiting time for PostCreateDelay is now integrated into delayBeforeNextExecution at the end of execution.
  • LoginUserOperationExecutor: Now detects if an anonymous Login request is made without other operations. When this happens, it returns a FAIL_NORETRY and drops the duplicated Login request.

Testing

Manual testing

###Steps to reproduce
This bug can happen strictly after CreateUser is successful for the first time but the app is force-closed before the 5 seconds PostCreateDelay has reached. One way to reproduce this issue is:

  1. Fresh install the app.
  2. Right after the subscription is created or receiving a 200 code from the Create User request, quickly force close the app within 5 seconds.
  3. Re-open the app, observe that a 400 error code is received and any updates like subscribing is not updated in the dashboard.

###New Install
Step 1-2 same as above
3. Re-open the app, wait for the server response and observe that Login with the external ID is successful. Confirmed that no error is logged and the fix works.

###Update From the Broken Version
Step 1-2 same as above
3. upgrade to this version without uninstalling the previous version or clearing the storage, observe that Login with the externalID is successful without error.

Affected code checklist

  • Notifications
    • Display
    • Open
    • Push Processing
    • Confirm Deliveries
  • Outcomes
  • Sessions
  • In-App Messaging
  • REST API requests
  • Public API changes

Checklist

Overview

  • I have filled out all REQUIRED sections above
  • PR does one thing
    • If it is hard to explain how any codes changes are related to each other then it most likely needs to be more than one PR
  • Any Public API changes are explained in the PR details and conform to existing APIs

Testing

  • I have included test coverage for these changes, or explained why they are not needed
  • All automated tests pass, or I explained why that is not possible
  • I have personally tested this on my device, or explained why that is not possible

Final pass

  • Code is as readable as possible.
    • Simplify with less code, followed by splitting up code into well named functions and variables, followed by adding comments to the code.
  • I have reviewed this PR myself, ensuring it meets each checklist item
    • WIP (Work In Progress) is ok, but explain what is still in progress and what you would like feedback on. Start the PR title with "WIP" to indicate this.

This change is Reviewable

Copy link
Member

@jkasten2 jkasten2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a test too

// Login operation being processed alone will surely be rejected, so we need to add an existing
// external ID to this legacy operation and have it picked up by SetAlias.
if (operations.isEmpty() && _identityModelStore.model.externalId != null) {
loginUserOp.externalId = _identityModelStore.model.externalId
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could there be an issue with properties being set on the wrong user?
Example:

  1. Login(A)
  2. setTag(A)
  3. Login(B)
  4. setTag(B)

Could this cause User B to get both tag A and B, when it was intended for User A to get tag A and User B to get Tag B?

Copy link
Contributor Author

@jinliu9508 jinliu9508 Feb 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tested this scenario and this is how the manual test is setup:

  1. Fresh install on main branch, right after the anonymous user is created but before the post create delay, programmatically call Login(A), setTag(A), Login(B), setTag(B) altogether and close the app.
  2. Switch to this branch and update the app.
  3. Run the app and observe in the dashboard that both users are created successfully and each with the correct tag.

After analyzing the log closely, the code in this if statement will actually never hit for the scenario. OperationRepo will process the operations in this order:

  1. Create subscription
  2. Login anonymously
  3. Refresh user (anonymous)
  4. Login A AND transfer subscription
  5. Set Alias for A but got 409
  6. Refresh user (A)
  7. Set Tag for A
  8. Login B AND transfer subscription
  9. Set Tag for B

Since operations like setting tags does not go with the Login call in one request and the handler for 409 conflict will take care of the IDs, this will not cause the user to be assigned with the wrong tag.

Further more, I tested out that this bug only happen when Login is interrupted with no other operations like set tag or another Login within a few seconds after app starts up.

@jinliu9508 jinliu9508 force-pushed the subscription-not-updated-due-to-400-error branch 4 times, most recently from 8c75690 to ea56a21 Compare February 3, 2025 23:43
Comment on lines 71 to 75
// Login operation being processed alone will surely be rejected, so we need to drop the request
if (operations.isEmpty()) {
return ExecutionResponse(ExecutionResult.FAIL_NORETRY)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If loginUserOp.externalId is set then the operation is still valid, we probably shouldn't drop that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the PR description, code, and test cases so that Login with external ID is no long be dropped. Only anonymous Login that is not grouped with any other operations will be dropped.

@jinliu9508 jinliu9508 force-pushed the subscription-not-updated-due-to-400-error branch from ea56a21 to efa596f Compare February 4, 2025 00:45
@jinliu9508 jinliu9508 requested a review from jkasten2 February 4, 2025 00:55
@jinliu9508 jinliu9508 force-pushed the subscription-not-updated-due-to-400-error branch 3 times, most recently from ddafc37 to 20951a6 Compare February 4, 2025 04:38
val delayFor = max(delayForOnRetries, retryAfterSecondsNonNull * 1_000)
val delayFor = max(delayForOnRetries, max(retryAfterSecondsNonNull * 1_000, postCreateDelay))
if (delayFor < 1) return
Logging.error("Operations being delay for: $delayFor ms")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging.error("Operations being delay for: $delayFor ms")

Following up from above, especially since we also log this delay as an error, since it represents a "failure response" fundamentally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log will not be related to the postCreateDelay anymore since it is moved to a separate logic.

// Handle a bad state that can happen in User Model 5.1.27 or earlier versions that old Login
// request is not removed after processing if app is force-closed within the PostCreateDelay.
// Anonymous Login being processed alone will surely be rejected, so we need to drop the request
if (operations.isEmpty() && loginUserOp.externalId == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log an error.

The isEmpty() is probably sufficient for this edge case. Technically this should be a check for the existence of a create/transfer subscription operation, as another user update operation, for example, is not sufficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking to prevent dropping the login request with an externalId. However, even login with an externalId will usually require a transfer subscription so theoretically these will have the same result. Do you think we should allow login operation with an externalId that has no other operations to be processed?

Copy link
Contributor

@nan-li nan-li Feb 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we should keep login with external ID regardless of subscriptions, since generally the SDK can recover in that state.
I meant for the check itself: operations.isEmpty() && loginUserOp.externalId == null vs something like operations.doesNotContainSub() && loginUserOp.externalId == null

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, with operations.doesNotContainSub() && loginUserOp.externalId == null. This way the validation matches what the backend does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a change that checks if there exists any CreateSubscription or TransferSubscription operation. Will this be sufficient?

}
}

test("ensure anonymous login with no other operations will fail with FAIL_NORETRY to release the queue from a bad state") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think explanations like "to release the queue from a bad state" should not be part of the test name, it should simply state behavior.
This tests the upgrade fix only, we should have a test that repros the original problematic behavior of not removing successful operations from the repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I have fixed the explanation and created a new test case in OperationRepoTests to test this behavior.


/**
* Wait which ever is longer, retryAfterSeconds returned by the server,
* Wait which ever is longer, post create delay, retryAfterSeconds returned by the server,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PostCreateDelay should probably be a separate delay from the original intention of this method, which is to generate a delay in failure scenarios, since it is operating on the retryWaiter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the delays should be interceptable because if both postCreateDelay and retryDelay are set to 5 seconds, the total wait time should just be 5 seconds instead of 10 seconds if wait for them separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't confound the waiter with the retryWaiter just because in this particular scenario, the result feels the same. The retryWaiter is meant for retryable errors and can be forced when consent changes.

Both are not going to run: (1) The postCreateDelay is after a create call is successful and the response has ID translations, (2) The retries count and retryAfterSeconds that is original to this method is for failures. Either the operations are successful and contain translation IDs or it is not and is retryable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, the waiters are different. I separated the logic in the fixup commit. Both waiters can be waiting together for different duration, but operations processing will be resumed as soon as both waiter are awake. In this case the stall time is the same as the maximum wait time among them.

@jinliu9508 jinliu9508 force-pushed the subscription-not-updated-due-to-400-error branch 3 times, most recently from 2c04789 to e5b8eb3 Compare February 4, 2025 20:06
@jinliu9508 jinliu9508 requested a review from nan-li February 4, 2025 20:08
@jinliu9508 jinliu9508 requested a review from jkasten2 February 4, 2025 20:32
* however other parts of the system don't currently account
* for this so this is not safe to do.
*/
suspend fun delayForPostCreate(hasIDTranslation: Boolean) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think this method should be conditionally called, instead of letting this method check if it should delay using the passed in argument.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I have made that change in the fixup commit.

@jinliu9508 jinliu9508 force-pushed the subscription-not-updated-due-to-400-error branch from 4f1368d to 5272900 Compare February 4, 2025 20:57
@jinliu9508 jinliu9508 requested a review from nan-li February 4, 2025 21:17
Copy link
Contributor

@nan-li nan-li left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can clean up commit history, and reword the PR title?

@jinliu9508 jinliu9508 changed the title Fix: bug causes the subscription not updated correctly with a 400 error Fix: Anonymous Login request not cleared if app is forced close within 5 seconds on a new install Feb 4, 2025
@jinliu9508 jinliu9508 force-pushed the subscription-not-updated-due-to-400-error branch from 5272900 to 47afee3 Compare February 4, 2025 23:11
@jinliu9508
Copy link
Contributor Author

Title is updated and commits are cleaned up.

@jinliu9508 jinliu9508 merged commit 9d1d91c into main Feb 4, 2025
2 checks passed
@jinliu9508 jinliu9508 deleted the subscription-not-updated-due-to-400-error branch February 4, 2025 23:23
@jinliu9508 jinliu9508 mentioned this pull request Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants