Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tests] Increase efcore timeout for cosmos in CosmosEndToEnd #5464

Merged
merged 1 commit into from
Aug 30, 2024

Conversation

radical
Copy link
Member

@radical radical commented Aug 27, 2024

.. playground app.

The /ef request times out after 65 seconds, which is the default timeout for efcore cosmos requests. Bumping this to 120 seconds makes the test run more stable.

Fixes issue: #5415

Microsoft Reviewers: Open in CodeFlow

@radical
Copy link
Member Author

radical commented Aug 28, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@radical radical added area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication testing ☑️ and removed area-codeflow for labeling automated codeflow. intentionally a different color! labels Aug 28, 2024
@radical
Copy link
Member Author

radical commented Aug 29, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@radical radical marked this pull request as ready for review August 29, 2024 18:06
@@ -9,7 +9,10 @@

builder.AddServiceDefaults();
builder.AddAzureCosmosClient("cosmos");
builder.AddCosmosDbContext<TestCosmosContext>("cosmos", "ef");
builder.AddCosmosDbContext<TestCosmosContext>("cosmos", "ef", configureDbContextOptions =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do this, or should we wait longer in the tests? Can't we wait until the Cosmos contianer says "ready"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do wait for cosmos to do that. And we invoke other endpoints that work:

 Calling endpoint 'http://localhost:36563/alive for resource 'api' in app 'CosmosEndToEnd.AppHost'
 Calling endpoint 'http://localhost:36563/health for resource 'api' in app 'CosmosEndToEnd.AppHost'
 Calling endpoint 'http://localhost:36563/ for resource 'api' in app 'CosmosEndToEnd.AppHost'

That last one is

app.MapGet("/", async (CosmosClient cosmosClient) =>
{
var db = (await cosmosClient.CreateDatabaseIfNotExistsAsync("db")).Database;
var container = (await db.CreateContainerIfNotExistsAsync("entries", "/Id")).Container;
// Add an entry to the database on each request.
var newEntry = new Entry() { Id = Guid.NewGuid().ToString() };
await container.CreateItemAsync(newEntry);
var entries = new List<Entry>();
var iterator = container.GetItemQueryIterator<Entry>(requestOptions: new QueryRequestOptions() { MaxItemCount = 5 });
var batchCount = 0;
while (iterator.HasMoreResults)
{
batchCount++;
var batch = await iterator.ReadNextAsync();
foreach (var entry in batch)
{
entries.Add(entry);
}
}
return new
{
batchCount = batchCount,
totalEntries = entries.Count,
entries = entries
};
});

So, I would expect /ef to work fine too. But it often times out like in this log.

 Calling endpoint 'http://localhost:36563/ef for resource 'api' in app 'CosmosEndToEnd.AppHost'

And the exceptions before retries are like:

A transient exception occurred during execution. The operation will be retried after 1081ms.
Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: RequestTimeout (408); Substatus: 0; ActivityId: 9e083133-9535-45ce-851a-713e40630608; Reason: (GatewayStoreClient Request Timeout. Start Time UTC:08/28/2024 22:25:38; Total Duration:65000.8637 Ms; Request Timeout 65000 Ms; Http Client Timeout:65000 Ms; Activity id: 9e083133-9535-45ce-851a-713e40630608;);
---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
---> System.Threading.Tasks.TaskCanceledException: The operation was canceled.
---> System.IO.IOException: Unable to read data from the transport connection: Operation canceled.
---> System.Net.Sockets.SocketException (125): Operation canceled
 --- End of inner exception stack trace ---
 at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
 at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource<System.Int32>.GetResult(Int16 token)
 at System.Net.Security.SslStream.EnsureFullTlsFrameAsync[TIOAdapter](CancellationToken cancellationToken, Int32 estimatedSize)
 at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
 at System.Net.Security.SslStream.ReadAsyncInternal[TIOAdapter](Memory`1 buffer, CancellationToken cancellationToken)
 at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
 at System.Net.Http.HttpConnection.InitialFillAsync(Boolean async)
  at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
  --- End of inner exception stack trace ---
  at System.Net.Http.HttpConnection.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
  at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
  at System.Net.Http.DiagnosticsHandler.SendAsyncCore(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
  at System.Net.Http.Metrics.MetricsHandler.SendAsyncWithMetrics(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
  at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
  at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
  --- End of inner exception stack trace ---
  at System.Net.Http.HttpClient.HandleFailure(Exception e, Boolean telemetryStarted, HttpResponseMessage response, CancellationTokenSource cts, CancellationToken cancellationToken, CancellationTokenSource pendingRequestsCts)
  at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
  at Microsoft.Azure.Cosmos.CosmosHttpClientCore.ExecuteHttpHelperAsync(HttpRequestMessage requestMessage, ResourceType resourceType, CancellationToken cancellationToken)
  at Microsoft.Azure.Cosmos.CosmosHttpClientCore.SendHttpHelperAsync(Func`1 createRequestMessageAsync, ResourceType resourceType, HttpTimeoutPolicy timeoutPolicy, IClientSideRequestStatistics clientSideRequestStatistics, CancellationToken cancellationToken)
  at Microsoft.Azure.Cosmos.GatewayStoreClient.InvokeAsync(DocumentServiceRequest request, ResourceType resourceType, Uri physicalAddress, CancellationToken cancellationToken)
  at Microsoft.Azure.Cosmos.GatewayStoreModel.ProcessMessageAsync(DocumentServiceRequest request, CancellationToken cancellationToken)
  at Microsoft.Azure.Cosmos.Handlers.TransportHandler.ProcessMessageAsync(RequestMessage request, CancellationToken cancellationToken)
  at Microsoft.Azure.Cosmos.Handlers.TransportHandler.SendAsync(RequestMessage request, CancellationToken cancellationToken)
  --- End of inner exception stack trace ---
  at Microsoft.Azure.Cosmos.ResponseMessage.EnsureSuccessStatusCode()
  at Microsoft.EntityFrameworkCore.Cosmos.Storage.Internal.CosmosClientWrapper.CreateContainerIfNotExistsOnceAsync(DbContext _, ValueTuple`2 parametersTuple, CancellationToken cancellationToken)
  at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.<>c__DisplayClass30_0`2.<<ExecuteAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
  at Microsoft.EntityFrameworkCore.Storage.ExecutionStrategy.ExecuteImplementationAsync[TState,TResult](Func`4 operation, Func`4 verifySucceeded, TState state, CancellationToken cancellationToken)
       --- Cosmos Diagnostics ---{"Summary":{"GatewayCalls":{"(0, 0)":1}},"name":"CreateContainerStreamAsync","start datetime":"2024-08-28T22:25:38.876Z","duration in milliseconds":65001.6493,"data":{"Client Configuration":{"Client Created Time Utc":"2024-08-28T22:24:33.6750368Z","MachineId":"vmId:d0263488-b5f7-4868-9f1d-8b403582d46c","VM Region":"westus","NumberOfClientsCreated":2,"NumberOfActiveClients":2,"ConnectionMode":"Gateway","User Agent":"cosmos-netstandard-sdk/3.42.0|2|X64|Ubuntu 22.04.4 LTS|.NET 8.0.7|DN| Microsoft.EntityFrameworkCore.Cosmos/8.0.8","ConnectionConfig":{"gw":"(cps:50, urto:6, p:False, httpf: False)","rntbd":"(cto: 5, icto: -1, mrpc: 30, mcpe: 65535, erd: True, pr: ReuseUnicastPort)","other":"(ed:True, be:False)"},"ConsistencyConfig":"(consistency: NotSet, prgns:[], apprgn: )","ProcessorCount":2}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RequestInvokerHandler","duration in milliseconds":65001.5201,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.DiagnosticsHandler","duration in milliseconds":65001.4947,"data":{"System Info":{"systemHistory":[{"dateUtc":"2024-08-28T22:25:50.7661661Z","cpu":6.727,"memory":3280836.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.292,"availableThreads":32766,"minThreads":2,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-08-28T22:26:00.7664554Z","cpu":6.593,"memory":3280888.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0488,"availableThreads":32766,"minThreads":2,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-08-28T22:26:10.7667479Z","cpu":15.760,"memory":3279748.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.232,"availableThreads":32766,"minThreads":2,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-08-28T22:26:20.7673058Z","cpu":7.456,"memory":3280048.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.2036,"availableThreads":32766,"minThreads":2,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-08-28T22:26:30.7676397Z","cpu":7.280,"memory":3280100.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.0502,"availableThreads":32766,"minThreads":2,"maxThreads":32767},"numberOfOpenTcpConnection":0},{"dateUtc":"2024-08-28T22:26:40.7679509Z","cpu":6.448,"memory":3279920.000,"threadInfo":{"isThreadStarving":"False","threadWaitIntervalInMs":0.2664,"availableThreads":32766,"minThreads":2,"maxThreads":32767},"numberOfOpenTcpConnection":0}]}},"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TelemetryHandler","duration in milliseconds":65001.4701,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RetryHandler","duration in milliseconds":65001.4639,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.RouterHandler","duration in milliseconds":65001.4264,"children":[{"name":"Microsoft.Azure.Cosmos.Handlers.TransportHandler","duration in milliseconds":65001.4223,"children":[{"name":"Microsoft.Azure.Cosmos.GatewayStoreModel Transport Request","duration in milliseconds":65001.2505,"data":{"Client Side Request Stats":{"Id":"AggregatedClientSideRequestStatistics","ContactedReplicas":[],"RegionsContacted":[],"FailedReplicas":[],"AddressResolutionStatistics":[],"StoreResponseStatistics":[],"HttpResponseStats":[{"StartTimeUTC":"2024-08-28T22:25:38.8768352Z","DurationInMs":65000.7969,"RequestUri":"https://127.0.0.1:38213/dbs/ef/colls","ResourceType":"Collection","HttpMethod":"POST","ActivityId":"9e083133-9535-45ce-851a-713e40630608","ExceptionType":"System.Threading.Tasks.TaskCanceledException","ExceptionMessage":"The operation was canceled."}]}}}]}]}]}]}]}]}]}

Note 'timeout: 65000 Ms in the CosmosException. Looking at the trace, my understanding is that this is the timeout from efcore, which does seem to use 65 secs default timeout for cosmos.

I'm not sure why it takes longer for the /ef case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the emulator is running, waiting over 1 minute to execute a query seems crazy. I'm not sure waiting for another minute is going to help.

@roji @AndriySvyryd @ajcvickers - Is there anything special about the EF cosmos client here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why, but it does seem to help. I tried to run the same test 6 times on the CI in one run, and trying that twice it didn't fail. I can test more iterations too.

It would be great to find out if that is reliable or not, and why this takes longer.

Copy link
Member

@AndriySvyryd AndriySvyryd Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anything special about the EF cosmos client here?

There shouldn't be, see https://github.com/dotnet/efcore/blob/main/src/EFCore.Cosmos/Storage/Internal/SingletonCosmosClientWrapper.cs#L32

From https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/troubleshoot-dotnet-sdk-request-timeout?tabs=cpu-new:

  • Is there high CPU utilization?
  • Is there high degree of concurrency?
  • Is it a large request or response?

@radical
Copy link
Member Author

radical commented Aug 30, 2024

/azp run

Copy link
Member

@eerhardt eerhardt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this change and see if that fixes the stability of these tests. I hit this one myself earlier today.

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@radical radical merged commit dbdec1c into dotnet:main Aug 30, 2024
11 checks passed
@radical radical deleted the playground-cosmos-ef branch August 30, 2024 20:20
@github-actions github-actions bot locked and limited conversation to collaborators Sep 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-app-model Issues pertaining to the APIs in Aspire.Hosting, e.g. DistributedApplication testing ☑️
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants