Conversation
There was a problem hiding this comment.
A bit of tighter scoping / more context on the TimeOfUseServiceTests please.
On the other one, how did you identify it's an indexing concern? We've been tracking an issue with intermittent errors trying to use profiles shortly after creation... that one's looking like replication lag between read-replicas. Might that be also what's going on here?
| private void cleanUpPrivateTou(TimeOfUseGroup grp) { | ||
| touService.deletePrivateTimeOfUseGroup(grp.getLseId(), grp.getTouGroupId()); | ||
| try { | ||
| // Check if the TOU group has been deleted |
There was a problem hiding this comment.
Add context to someone reading the code without knowledge of this issue: "Try to delete the private TOU only if it still exists (sometimes it might be already deleted)"
| touService.deletePrivateTimeOfUseGroup(grp.getLseId(), grp.getTouGroupId()); | ||
| } | ||
| } catch (GenabilityException e) { | ||
| if (!e.getMessage().contains("404")) { |
There was a problem hiding this comment.
You have two calls, which throws the GenabilityException? Wrap the try/catch around just that one -- you generally want try/catches as tightly scoped as possible.
Also explain why a 404-containing message should be ignored (didn't you already check it exists?)
| } | ||
|
|
||
| try { | ||
| // Add a delay to allow for indexing |
There was a problem hiding this comment.
huh. How did you figure out it's an indexing concern? (Can you add a comment or note on the PR?)
How confident are you that 2000ms is enough? If it's just a guess / subject to additional tuning, it's helpful for future developers to eg add a comment to the line that errors like "This line fail with xxxException? Consider tuning the sleep above to give more time..."
There was a problem hiding this comment.
good question - I assumed it was indexing or a consistency issue because the test creates accounts and immediately searches for them using a name filter. I added the .sleep(2000) to give mongo time to catch up, which fixed the flakiness locally, but its definitely just a best guess. Its kind of tricky to test this since the failures were intermittent.
| // Could be due to indexing delay or read-replica sync delay. | ||
| // 2000ms is a best-guess based on some testing; if this test continues to fail, consider increasing this delay. | ||
| try { | ||
| Thread.sleep(2000); |
There was a problem hiding this comment.
2 seconds is a high value. I think replica lag would be under 100ms typically. I wonder if 500ms or 200ms are better starting values.
| try { | ||
| touService.deletePrivateTimeOfUseGroup(grp.getLseId(), grp.getTouGroupId()); | ||
| } catch (GenabilityException e) { | ||
| // It's possible the TOU group was deleted after the check but before the delete call. |
There was a problem hiding this comment.
This is my concern that we are retrieving/creating TOU grp by generic test name but deleting TOUs by integer ID. This opens us up to concurrency condition with other simultaneous tests. Not sure with github job runner if this a real issue, but why would the tou group be deleted before the clean up?
Intermittent failure: SDK > TimeOfUseServiceTests.addPrivateTouGroupWorksCorrectly
The test itself passes (adding the TOU group works)
The failure occurs during cleanup with a 404 error when trying to delete the TOU group
The resource was getting deleted by another process or there's a race condition
Intermittent failure: SDK > AccountServiceTests.testPaginatedAccountList:139