Skip to content

[Diagnostics]Add ContinuationToken for pkRanges request in Diagnostics #3167

Closed

Description

Add PkRanges request CT for pkRanges in the diagnostics to help investigate the following exceptions/scenarios.

Request failed at:

{""Id"":""PointOperationStatistics"",""ActivityId"":""e035d1b0-7646-4a09-97cc-002534d7b4c4"",""ResponseTimeUtc"":""2022-04-28T17:18:49.2581143Z"",""StatusCode"":404,""SubStatusCode"":0,""RequestCharge"":0,""RequestUri"":""dbs/usersettings/colls/usersettings"",""ErrorMessage"":""Microsoft.Azure.Documents.NotFoundException: Entity with the specified id does not exist in the system. More info: https://aka.ms/cosmosdb-tsg-not-found\r\nActivityId: e035d1b0-7646-4a09-97cc-002534d7b4c4, Microsoft.Azure.Cosmos.Tracing.TraceData.ClientSideRequestStatisticsTraceDatum, Windows/10.0.17763 cosmos-netstandard-sdk/3.19.3\r\n   at Microsoft.Azure.Cosmos.AddressResolver.<ResolveAddressesAndIdentityAsync>d__12.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Cosmos.AddressResolver.<ResolveAsync>d__9.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Cosmos.Routing.GlobalAddressResolver.<ResolveAsync>d__14.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.AddressSelector.<ResolveAddressesAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.AddressSelector.<ResolveAllTransportAddressUriAsync>d__3.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.StoreReader.<ReadMultipleReplicasInternalAsync>d__12.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.StoreReader.<ReadMultipleReplicaAsync>d__10.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.ConsistencyReader.<ReadSessionAsync>d__13.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)\r\n   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetryAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()\r\n   at Microsoft.Azure.Documents.ShouldRetryResult.ThrowIfDoneTrying(ExceptionDispatchInfo capturedException)\r\n   at Microsoft.Azure.Documents.BackoffRetryUtility`1.<ExecuteRetryAsync>d__5.MoveNext()\r\n--- End of stack trace from previous location where exception was thrown ---\r\n   at 

Request flow is:

Will use pkRangeId 125 as exmaple. First of all, when customer is using direct mode, in order for SDK to find which server to send the request to, there are few critical information we need to get back from gateway, one is pkRanges, one is addresses for a certain partition.

- At some time, SDK get pkRanges list from gateway which includes 125
- Split happend for pkRange 125
- Request come in, SDK use the existing pkRanges from the cache to resolve which partition the request should be routed to, which resulted as 125
- SDK trying to get address list from gateway for pkrange 125. But gateway encountered ServiceFabricNotFoundException because the service has been deleted as part of the split process, so gateway return empty list in this case
- Since it is empty list, SDK has tried to refresh its internal status, including to get any latest changes of pkRanges from gateway. However, SDK get NotModified result back from gateway
- Step #3 and #4 got repeated, and then NotFoundException returned.

Due to few informations missing in the current diagnostics, we are not able to reason about what were the updates to the pkranges cache in client side, why we are getting NotModified from gateway team and why SDK has tried to get addresses for pkRange 125 again (instead of the new child ranges).

Based on the investigation above, there are two piece information will be helpful for the investigation in the future.

  1. ContinuationToken for pkRanges
  2. For change feed pkRanges request, log related changes -- [Diagnostics]Log pkRanges change #3178
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    DiagnosticsIssues around diagnostics and troubleshootingfeature-requestNew feature or request

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions