Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Session: Fixes NotFound/ReadSessionNotAvailable (404/1002) errors due to inconsistencies on internal caches on collection-recreate scenario for query-only workloads #3119

Conversation

FabianMeiswinkel
Copy link
Member

@FabianMeiswinkel FabianMeiswinkel commented Mar 29, 2022

Pull Request Template

Description

SCENARIO

  • App using .Net SDK running queries against Container_1 or Container_2
  • Every 24 hours the app changes reading between the two of them
  • Before switching the query app to a new Container of either name the Container was deleted, re-created and data was ingested via a Spark job from a different process
  • Customer was seeing 404/1002 Not Found/Read Session not available regularly

ROOT CAUSE

  • SessionContainer and CollectionCache have both Dictionaries for a CollectionName to CollectionRid lookup
  • In some places where a stale collection name was identified, not both dictionaries were updated
  • Therefore, it was possible that the CollectionCache was updated so CollectionRid on DocumentServiceRequest was populated correctly (for the new container) but SessionContainer still mapped the container name to the old CollectionRid - and as such still found the SessionToken captured from the old container
  • This could result in either using a stale LSN (LSN captured on old container less than current LSN on new container - which would not be an issue - or LSN captured on old container was higher than the latest LSN on the new container - so all subsequent queries would fail with 404/1002

IMPACT

  • This would only permanently leave the CosmosClient in bad state when
    • Session consistency is used,
    • container deletes and recreates are happening,
    • no point operations are used with the same CosmosClient instance (for point operations the RenameCollectionAwareClientRetryPolicy would have recovered the client instance because the session cache would have been purged)
    • A CollectionCache refresh is happening after the recreation without also updating the SessionContainer (for example via Container.GetFeedRanges - which refreshed ColectionCache but not SessionContainer). If both caches are still stale, a query would trigger a 410/1000 (Gone/NameCacheIsStale) for which the retry policy would have purged the session container and collection cache.

TEST COVERAGE

  • Without this PR this test was consistently failing (either due to 404/1022 or because the requested session token was lower than the latest session token captured on the new container (meaning possible risk of returning outdated data and violating read your own write semantic in theory because the requested session token was the last session token seen on the old container (but sent to backend with CollectionRid of new container which might have had further updates)

Type of change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • [] New feature (non-breaking change which adds functionality)
  • [] Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • [] This change requires a documentation update

Closing issues

To automatically close an issue: closes #IssueNumber

…up in SessionContainer and CollectionCache getting out-of-sync
@FabianMeiswinkel FabianMeiswinkel changed the title Fixes 404/1002 due to Dictionaries used for CollectionName->RID lookup in SessionContainer and CollectionCache getting out-of-sync Session: Fixes 404/1002 due to inconsistencies on internal caches on collection-recreate scenario for query workloads Mar 29, 2022
@FabianMeiswinkel FabianMeiswinkel changed the title Session: Fixes 404/1002 due to inconsistencies on internal caches on collection-recreate scenario for query workloads Session: Fixes NotFound/ReadSessionNotAvailable (404/1002) errors due to inconsistencies on internal caches on collection-recreate scenario for query workloads Mar 29, 2022
@FabianMeiswinkel FabianMeiswinkel changed the title Session: Fixes NotFound/ReadSessionNotAvailable (404/1002) errors due to inconsistencies on internal caches on collection-recreate scenario for query workloads Session: Fixes NotFound/ReadSessionNotAvailable (404/1002) errors due to inconsistencies on internal caches on collection-recreate scenario for query-only workloads Mar 29, 2022
@FabianMeiswinkel FabianMeiswinkel merged commit 4be9054 into Azure:master Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants