Investigate Failing Test: Lucene.Net.Replicator.Http.HttpReplicatorTest::TestBasic() #363

NightOwl888 · 2020-10-06T15:34:53Z

This test failure began occurring when we set up the nightly build.

Note that the failure could have something to do with changing the nightly limits of Rarely() and AtLeast() which were changed to try to make the nightly tests run in under 1 hour.

Error message
System.Threading.Tasks.TaskCanceledException : A task was canceled.


Stack trace
   at Microsoft.AspNetCore.TestHost.ResponseStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken)
   at System.IO.Stream.CopyToAsyncInternal(Stream destination, Int32 bufferSize, CancellationToken cancellationToken)
   at System.Net.Http.HttpContent.LoadIntoBufferAsyncCore(Task serializeToStreamTask, MemoryStream tempBuffer)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Lucene.Net.Replicator.Http.HttpClientBase.ExecuteGet(String request, String[] parameters) in D:\a\1\s\src\Lucene.Net.Replicator\Http\HttpClientBase.cs:line 229
   at Lucene.Net.Replicator.Http.HttpReplicator.CheckForUpdate(String currentVersion) in D:\a\1\s\src\Lucene.Net.Replicator\Http\HttpReplicator.cs:line 71
   at Lucene.Net.Replicator.ReplicationClient.DoUpdate() in D:\a\1\s\src\Lucene.Net.Replicator\ReplicationClient.cs:line 228
   at Lucene.Net.Replicator.ReplicationClient.UpdateNow() in D:\a\1\s\src\Lucene.Net.Replicator\ReplicationClient.cs:line 472
   at Lucene.Net.Replicator.Http.HttpReplicatorTest.TestBasic() in D:\a\1\s\src\Lucene.Net.Tests.Replicator\Http\HttpReplicatorTest.cs:line 103

The text was updated successfully, but these errors were encountered:

…pache#363 and apache#269

…363 and #269

jeme · 2021-05-05T17:28:24Z

I have not had a lucene environment up and running for a while (sadly) - is this reproduceable locally?

NightOwl888 · 2021-05-05T22:05:26Z

Unfortunately, even with the [Repeat] or [FindFirstFailingSeed] attributes I was unable to get this to fail on my local (Windows 10) system, but admittedly I gave up rather easily after only a couple of attempts. Perhaps doing a Debug build on Azure DevOps, downloading the testbinaries assets, and then doing a local dotnet test while attaching a debugger is a way of reproducing it, but that hasn't been attempted yet.

This might have something to do with the fact that we have upgraded Microsoft.AspNetCore.TestHost to 2.0.0 when we dropped support for .NET Standard 1.x, or it might just be something we missed because the tests weren't run frequently before the nightly build was set up.

Off topic: It isn't clear whether the ReplicationClient.ReplicationThread is working as was expected, being that in Lucene it was passing ThreadInterruptedException from the background thread to the calling thread. Would reverting this to a ThreadJob subclass (going back to a more Java-like design) break anything substantial? ThreadJob takes care of propagating the first exception from the background thread to the calling thread.

jeme · 2021-05-14T18:29:32Z

Ok, that is unfortunate as that would likely make it easier to fix.
I have been trying to go through the code to formalize a theory, but so far I have not been able to come up with something good, but this mapping "ASYNC" code to old style blocking code so I am wondering if that can cause trouble, after all it depends on where and in what the code runs how tasks are handled. (That is one thing I really think was a shitty choice by the .NET framework designers... regardless of them wanting to make it "Easier" to write async code in form based apps... And I am wondering if that is a thing that might hit us here...)

As far as goes for the threading, I choose the design that is in place as that does not allocate a dedicated thread which would be against general recommendations, as well as not allowing workitem exceptions to bring down the whole process (which is a difference to JAVA).

But I can't see that it would break anything to switch to the ThreadJob class as such. We could also do the same in the current implementation, catching the exception that halted the executing thread and then rethrowing that on a call to "StopUpdateThread".

Threading and Exception in Java and .NET is very different, so it can be hard to be certain to ensure the Java behavior fully, so i think it's worth looking critically on the "Java Design" in such areas as we can hardly avoid differences anyways, then if it can be implemented in a .NET way in a "isolated" way, I would prefer that. But I have not hard feelings otherwise.

StopUpdateThread is never called as far as I can tell in this case though so regardless of one or the other, I think that could be a discusison that could be taken in another issue.

NightOwl888 · 2021-10-23T22:18:35Z

This seems to be a known issue with Microsoft and there is a lot of discussion about it. However, I haven't yet tried any of the proposed solutions.

…t from 1000 ms to 100 s. Renamed DEFAULT_CONNECTION_TIMEOUT to DEFAULT_TIMEOUT and ConnectionTimeout property to Timeout to match HttpClient. Fixes apache#363.

NightOwl888 · 2021-11-06T18:01:43Z

@jeme

I just submitted some patches for Replicator in #534, including the fix for this issue and 2 other tests that were failing more rarely and sometimes deadlocking.

During this process, I ended up going through and updating the AspNetCore tests to follow a more modern approach (with a conditional compile for the .NET Core 3.1 and lower tests). And of course, it was a bit cumbersome because of the lack of async/await support, as you mentioned. It seems like there should be a way to get there, but more analysis is needed to know how deep that rabbit hole goes. It would be nice to support async/await for the AspNetCore users of replicator, and if we do provide a solution it would be safer to provide it before the 4.8.0 release to avoid breaking any APIs. That said, it isn't a critical feature so we could live without it if need be.

But it doesn't look like it is a contributor to any real problem with Replicator. Without the async support, we are probably blocking other parallel tasks from occurring. I had to add a configuration setting to the context to allow blocking IO code to run because in .NET 5 it will throw an InvalidOperationException to remind you that it is bad to run synchronous code when you can run async.

But I can't see that it would break anything to switch to the ThreadJob class as such. We could also do the same in the current implementation, catching the exception that halted the executing thread and then rethrowing that on a call to "StopUpdateThread".

The tests that were overriding HandleUpdateException and throwing within that method, which is called within an unhandled catch block. So, any exceptions that were thrown in the tests were causing the thread to terminate. I suspect this was causing the failures and possibly contributing to the deadlocks, but I tested this very thoroughly, and there are no longer any failures after moving to ThreadJob.

You are right, it could have been done in the current implementation, but this does end up being less code to maintain by subclassing ThreadJob. Maybe there is a way we can still convert ThreadJob to use a ThreadPool, and it is probably something that is worth looking into being that it is used frequently across the codebase.

…t from 1000 ms to 100 s. Renamed DEFAULT_CONNECTION_TIMEOUT to DEFAULT_TIMEOUT and ConnectionTimeout property to Timeout to match HttpClient. Fixes #363.

NightOwl888 added up-for-grabs This issue is open to be worked on by anyone help-wanted Extra attention is needed testability test-failure pri:high labels Oct 6, 2020

NightOwl888 added this to the 4.8.0 milestone Oct 6, 2020

NightOwl888 added a commit to NightOwl888/lucenenet that referenced this issue Nov 20, 2020

Added the AwaitsFixAttribute to the known failing tests described in a…

b45f89a

…pache#363 and apache#269

NightOwl888 added a commit to NightOwl888/lucenenet that referenced this issue Nov 20, 2020

Added the AwaitsFixAttribute to the known failing tests described in a…

8919f13

…pache#363 and apache#269

NightOwl888 added a commit to NightOwl888/lucenenet that referenced this issue Nov 21, 2020

Added the AwaitsFixAttribute to the known failing tests described in a…

7e6eb91

…pache#363 and apache#269

NightOwl888 added a commit that referenced this issue Nov 21, 2020

Added the AwaitsFixAttribute to the known failing tests described in #…

aa37f7d

…363 and #269

NightOwl888 removed up-for-grabs This issue is open to be worked on by anyone help-wanted Extra attention is needed labels Nov 6, 2021

NightOwl888 mentioned this issue Nov 6, 2021

Lucene.Net.Replicator Bug Fixes (Fixes #363) #534

Merged

NightOwl888 closed this as completed in #534 Nov 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate Failing Test: Lucene.Net.Replicator.Http.HttpReplicatorTest::TestBasic() #363

Investigate Failing Test: Lucene.Net.Replicator.Http.HttpReplicatorTest::TestBasic() #363

NightOwl888 commented Oct 6, 2020

jeme commented May 5, 2021

NightOwl888 commented May 5, 2021

jeme commented May 14, 2021

NightOwl888 commented Oct 23, 2021

NightOwl888 commented Nov 6, 2021

Investigate Failing Test: Lucene.Net.Replicator.Http.HttpReplicatorTest::TestBasic() #363

Investigate Failing Test: Lucene.Net.Replicator.Http.HttpReplicatorTest::TestBasic() #363

Comments

NightOwl888 commented Oct 6, 2020

jeme commented May 5, 2021

NightOwl888 commented May 5, 2021

jeme commented May 14, 2021

NightOwl888 commented Oct 23, 2021

NightOwl888 commented Nov 6, 2021