-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HystrixCollapser completes without emitting #955
Comments
Looks similar to #815 |
@amitcse Are you able to provide implementations of those collapsers that would work in a unit test? |
@mattrjacobs I don't see the |
I think I removed this code when I fixed the previous
|
Thanks. I will try this and let you know if I am able to find something. |
@mattrjacobs Sorry for the late reply. I do see Stacktrace:
In this failure, |
@amitcse That does look like a real issue. Thanks for the repro. I'll re-add this unit test and figure out whats going on |
Hi @mattrjacobs , I was able to reproduce this issue of Also, I would like point out that the behaviour I am seeing is very similar to #257 . |
Interesting. Thanks for the details. I'll see if I can write a test that does more concurrent work and make the failure more likely. |
@mattrjacobs Were you able to reproduce with concurrent test ? If you could give me any example of concurrent test, I would also try to write the test ? |
I haven't had time yet, but this is on my list for today. Will let you know |
@amitcse I just added a test that spins up On my local machine, with Can you take a look at those tests, and see if there's anything you can think to add? |
Thanks @mattrjacobs for the test. I ran the test for 50000 trials with
|
OK, I'm investigating this now. |
@mattrjacobs Any update on this ? |
Not yet. I haven't been able to replicate |
@mattrjacobs Were you able to reproduce ? Did you try the the number of trials and threads that I mentioned above ? I am able to reproduce it very easily on my machine. Can you also please check if this issue maybe occurring due to the race condition mentioned in issue 257 ? |
I can reproduce the failure above, but the root cause is a timeout on the underlying HystrixObservableCommand. My suspicion is that that eventually happens when a command starts executing, then doesn't get to run again due to bad luck from the thread-scheduler. This manifests as a command exception, which shows up to the caller as missing data (due to the The issue you referenced (#257) was fixed in RxJava 0.18.2 and picked up in Hystrix 1.3.6. So I'm confident that issue has been resolved. There may be a different concurrency issue somewhere, but I still haven't been able to reproduce that. Since you started the issue with a concrete example of something failing, can you share that? |
@mattrjacobs If this issue is because of HystrixObservableCommand timeout, shouldn't we get a timeout exception? |
The failure that I am seeing is same as the concurrent test failure. There are no exception stack trace which I can share. The collapser doesn’t emit anything and the data is simply missed. We can only know this by an explicit check. |
@amitcse You're correct on the timeout bit. I just found a way to more reliably replicate this - by running the HystrixObservableCommand on a non-shared I/O thread, I can usually get an error in the first 400 trials. Am digging in further now |
I can get the non-delivery of values to go away reliably if I "speed up" the collapser. I tested 1000 trials of m=1000,n=500 many times. I discovered a fairly large contention point in the unit test. Each collapser adds a callback to the same Archaius property, which uses a CopyOnWriteSet under the hood. This was dramatically slowing down later trials of the unit test. When I switched away from Archaius and just used non-configurable collapser properties, the issue was not reproducible. I think the Archaius slowness is a separate issue, and well worth figuring out. I think the problem with collapsers is something around an unexpected code path when things get extremely latent. I'll keep debugging that angle. |
@mattrjacobs Any update on finding the bug ? |
@mattrjacobs These changes will help to surface the error, instead of silently missing data as happening currently. But, the issue needs to be fixed. I frequently see this error in production, and to overcome this we had to implement a retry command layer, so that failures can be masked to client. How is it being used in Netflix prod environment ? Is this issue not observed ? |
If you could include an example of code that's not receiving values, that would be helpful. We do use collapsing heavily in production, and we've never observed this issue ourselves. |
My production code is very similar to the test you wrote, only that actual network calls are being made. Did the test didn't help much ? |
This unit test doesn't fail for me (nor in CI), so I think the problem you're seeing must be different |
Are the tests not failing after you switched from Archaius and used non-configurable collapsers, or it never failed ? I was under the impression that the same tests are failing for you too and you are debugging that angle when things are latent. Anyway, lets merge this code so that I'll be able get the stack traces after these changes you made. Hopefully, those will help us to get to the root cause. Meanwhile, I'll see if I can write something more than those tests that may help us isolate the issue. |
Yes, this code is already merged in. It's still using Archaius, just caching the properties setter, so not as many callbacks get used. As a result, no synthetic slowdowns are introduced. These tests are not failing for me |
Closing due to inactivity. Please re-open if there's more to discuss |
Hi, I am facing a weird issue wherein the observable returned by
HystrixCollapser
completes without emitting any item. Since I am usingHystrixCollapser.toObservable()
inside to do composition using RxJava operator, this is leading to items being missed from the data flow.On debugging I have found that the
HystrixCollapser
do call theHystrixCommand.run()
andmapResponsetoRequests
function is also called.I am unable to debug this more because this behaviour is completely random.
For reference, my rx observable composition is something like this,
Since any of collapser2/3/4 completes without emitting, this lead to zip being called and corresponding related result goes missing from the final list.
The text was updated successfully, but these errors were encountered: