Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate Failing Test: Lucene.Net.Replicator.Http.HttpReplicatorTest::TestBasic() #363

Closed
NightOwl888 opened this issue Oct 6, 2020 · 5 comments · Fixed by #534
Closed

Comments

@NightOwl888
Copy link
Contributor

This test failure began occurring when we set up the nightly build.

Note that the failure could have something to do with changing the nightly limits of Rarely() and AtLeast() which were changed to try to make the nightly tests run in under 1 hour.

Error message
System.Threading.Tasks.TaskCanceledException : A task was canceled.


Stack trace
   at Microsoft.AspNetCore.TestHost.ResponseStream.ReadAsync(Byte[] buffer, Int32 offset, Int32 count, CancellationToken cancellationToken)
   at System.IO.Stream.CopyToAsyncInternal(Stream destination, Int32 bufferSize, CancellationToken cancellationToken)
   at System.Net.Http.HttpContent.LoadIntoBufferAsyncCore(Task serializeToStreamTask, MemoryStream tempBuffer)
   at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
   at Lucene.Net.Replicator.Http.HttpClientBase.ExecuteGet(String request, String[] parameters) in D:\a\1\s\src\Lucene.Net.Replicator\Http\HttpClientBase.cs:line 229
   at Lucene.Net.Replicator.Http.HttpReplicator.CheckForUpdate(String currentVersion) in D:\a\1\s\src\Lucene.Net.Replicator\Http\HttpReplicator.cs:line 71
   at Lucene.Net.Replicator.ReplicationClient.DoUpdate() in D:\a\1\s\src\Lucene.Net.Replicator\ReplicationClient.cs:line 228
   at Lucene.Net.Replicator.ReplicationClient.UpdateNow() in D:\a\1\s\src\Lucene.Net.Replicator\ReplicationClient.cs:line 472
   at Lucene.Net.Replicator.Http.HttpReplicatorTest.TestBasic() in D:\a\1\s\src\Lucene.Net.Tests.Replicator\Http\HttpReplicatorTest.cs:line 103
@jeme
Copy link
Contributor

jeme commented May 5, 2021

I have not had a lucene environment up and running for a while (sadly) - is this reproduceable locally?

@NightOwl888
Copy link
Contributor Author

Unfortunately, even with the [Repeat] or [FindFirstFailingSeed] attributes I was unable to get this to fail on my local (Windows 10) system, but admittedly I gave up rather easily after only a couple of attempts. Perhaps doing a Debug build on Azure DevOps, downloading the testbinaries assets, and then doing a local dotnet test while attaching a debugger is a way of reproducing it, but that hasn't been attempted yet.

This might have something to do with the fact that we have upgraded Microsoft.AspNetCore.TestHost to 2.0.0 when we dropped support for .NET Standard 1.x, or it might just be something we missed because the tests weren't run frequently before the nightly build was set up.

Off topic: It isn't clear whether the ReplicationClient.ReplicationThread is working as was expected, being that in Lucene it was passing ThreadInterruptedException from the background thread to the calling thread. Would reverting this to a ThreadJob subclass (going back to a more Java-like design) break anything substantial? ThreadJob takes care of propagating the first exception from the background thread to the calling thread.

@jeme
Copy link
Contributor

jeme commented May 14, 2021

Ok, that is unfortunate as that would likely make it easier to fix.
I have been trying to go through the code to formalize a theory, but so far I have not been able to come up with something good, but this mapping "ASYNC" code to old style blocking code so I am wondering if that can cause trouble, after all it depends on where and in what the code runs how tasks are handled. (That is one thing I really think was a shitty choice by the .NET framework designers... regardless of them wanting to make it "Easier" to write async code in form based apps... And I am wondering if that is a thing that might hit us here...)

As far as goes for the threading, I choose the design that is in place as that does not allocate a dedicated thread which would be against general recommendations, as well as not allowing workitem exceptions to bring down the whole process (which is a difference to JAVA).

But I can't see that it would break anything to switch to the ThreadJob class as such. We could also do the same in the current implementation, catching the exception that halted the executing thread and then rethrowing that on a call to "StopUpdateThread".

Threading and Exception in Java and .NET is very different, so it can be hard to be certain to ensure the Java behavior fully, so i think it's worth looking critically on the "Java Design" in such areas as we can hardly avoid differences anyways, then if it can be implemented in a .NET way in a "isolated" way, I would prefer that. But I have not hard feelings otherwise.

StopUpdateThread is never called as far as I can tell in this case though so regardless of one or the other, I think that could be a discusison that could be taken in another issue.

@NightOwl888
Copy link
Contributor Author

This seems to be a known issue with Microsoft and there is a lot of discussion about it. However, I haven't yet tried any of the proposed solutions.

@NightOwl888 NightOwl888 removed up-for-grabs This issue is open to be worked on by anyone help-wanted Extra attention is needed labels Nov 6, 2021
NightOwl888 added a commit to NightOwl888/lucenenet that referenced this issue Nov 6, 2021
…t from 1000 ms to 100 s. Renamed DEFAULT_CONNECTION_TIMEOUT to DEFAULT_TIMEOUT and ConnectionTimeout property to Timeout to match HttpClient. Fixes apache#363.
@NightOwl888
Copy link
Contributor Author

@jeme

I just submitted some patches for Replicator in #534, including the fix for this issue and 2 other tests that were failing more rarely and sometimes deadlocking.

During this process, I ended up going through and updating the AspNetCore tests to follow a more modern approach (with a conditional compile for the .NET Core 3.1 and lower tests). And of course, it was a bit cumbersome because of the lack of async/await support, as you mentioned. It seems like there should be a way to get there, but more analysis is needed to know how deep that rabbit hole goes. It would be nice to support async/await for the AspNetCore users of replicator, and if we do provide a solution it would be safer to provide it before the 4.8.0 release to avoid breaking any APIs. That said, it isn't a critical feature so we could live without it if need be.

But it doesn't look like it is a contributor to any real problem with Replicator. Without the async support, we are probably blocking other parallel tasks from occurring. I had to add a configuration setting to the context to allow blocking IO code to run because in .NET 5 it will throw an InvalidOperationException to remind you that it is bad to run synchronous code when you can run async.

But I can't see that it would break anything to switch to the ThreadJob class as such. We could also do the same in the current implementation, catching the exception that halted the executing thread and then rethrowing that on a call to "StopUpdateThread".

The tests that were overriding HandleUpdateException and throwing within that method, which is called within an unhandled catch block. So, any exceptions that were thrown in the tests were causing the thread to terminate. I suspect this was causing the failures and possibly contributing to the deadlocks, but I tested this very thoroughly, and there are no longer any failures after moving to ThreadJob.

You are right, it could have been done in the current implementation, but this does end up being less code to maintain by subclassing ThreadJob. Maybe there is a way we can still convert ThreadJob to use a ThreadPool, and it is probably something that is worth looking into being that it is used frequently across the codebase.

NightOwl888 added a commit that referenced this issue Nov 7, 2021
…t from 1000 ms to 100 s. Renamed DEFAULT_CONNECTION_TIMEOUT to DEFAULT_TIMEOUT and ConnectionTimeout property to Timeout to match HttpClient. Fixes #363.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants