Fix stale notification when eviction raced with an update #144

ben-manes · 2017-02-18T05:50:47Z

SOLR-10141 (thanks Yonik!)

As an optimization, an update is allowed to bypass the hash map and
synchronize on the read entry directly. In this block it checks
liveliness, perform the mutation, and notifies the writer. This avoids
more expensive computations through the map.

Previously, an eviction was performed in a computation to remove the
entry and notify the writer, or resurrect. Inside the computation the
entry was not synchronized on, and that was done only after it was
removed from the table. The removal listener was notified with the
value initially read at the start of this method.

This allowed an update to modify the value while (or after) the entry
was removing it from the hash table. This led to notifying the writer
and removal listener with the stale value. Because the writer must
be called exclusively with the mutation, this computation must use
a synchronized guard. Otherwise we might have preferred to re-read
the value when notifying the listener. This adds a slight penalty on
eviction (async) while allowing put to still be fast (but may block).

putSlow was removed as not longer needed. It was a computation-based
write that was safe from this issue. But it was only used when the new
weight was zero, as that update race would cause an incorrect eviction.
Now that the primary path is safe from this race, its unnecessary.

@yonik please review (thanks for finding this!)
@johnou (I think you mentioned that you may have observed this)

coveralls · 2017-02-18T06:27:48Z

Coverage decreased (-0.1%) to 93.757% when pulling f2dd9f0 on solr into c5cc2d4 on master.

yonik

Changes look good Ben, thanks for the quick fix!

johnou

Thanks for the heads up!

ben-manes · 2017-02-18T22:51:13Z

Hey @johnou, you might be interested in this slide deck. I didn't know your email so couldn't ping you about it, but seemed to be in your area of interest. It was for a private talk at a startup nearby a week ago.

[SOLR-10141](https://issues.apache.org/jira/browse/SOLR-10141) (thanks Yonik!) As an optimization, an update is allowed to bypass the hash map and synchronize on the read entry directly. In this block it checks liveliness, performs the mutation, and notifies the writer. This avoids more expensive computations through the map. Previously, an eviction was performed in a computation to remove the entry and notify the writer, or resurrect. Inside the computation the entry was not synchronized on and that was done only after it was removed from the table. The removal listener was notified with the value initially read at the start of this method. This allowed an update to modify the value while (or after) the entry was being removed from the hash table. This led to notifying the writer and removal listener with the stale value. Because the writer must be called exclusively with the mutation, this computation must use a synchronized guard. Otherwise we might have preferred to re-read the value when notifying the listener. This adds a slight penalty on eviction (usually async) while allowing `put` to still be fast (but may block a little more often). `putSlow` was removed as it is no longer needed. It was a computation- based write that was safe from this issue. But it was only used when the new weight was zero, as that update race would cause an incorrect eviction. Now that the primary path is safe from this race, it is unnecessary.

johnou · 2017-02-18T23:15:49Z

@ben-manes you bet I am, always welcome to reach out to me at johno.crawford@gmail.com

ben-manes · 2017-02-19T05:13:41Z

I fixed a similar bug in when clearing the cache usinginvalidateAll() or asMap().clear(). I added a simpler version of @yonik's test that fails prior to the changes. I've also added unit tests that rely on explicit lock ordering of the implementation to deterministically validate the eviction and clear bugs.

I've run all the tests locally and will release after the CI confirms.

ben-manes · 2017-02-19T07:00:08Z

Released 2.4.0

johnou · 2017-02-20T15:31:48Z

@ben-manes did you ever hear back from Doug about the tickless hierarchical timing wheel in Java?

ben-manes · 2017-02-20T16:14:38Z

When I brought it up he said it was on his backlog too, but I haven't talked with him about it since. I think jdk9 has a dedicated scheduler thread for the common fjp. Perhaps if its overloaded he'll add a ticking one.

I would like to add that feature by a jdk9 release, since I'll bump up the version and fix API quirks. I don't think it's hard, just requires a little motivation.

ben-manes force-pushed the solr branch 2 times, most recently from 41376d6 to f2dd9f0 Compare February 18, 2017 06:01

yonik approved these changes Feb 18, 2017

View reviewed changes

johnou approved these changes Feb 18, 2017

View reviewed changes

ben-manes force-pushed the solr branch from f2dd9f0 to 83b47d1 Compare February 18, 2017 22:54

ben-manes merged commit 83b47d1 into master Feb 18, 2017

ben-manes deleted the solr branch February 18, 2017 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix stale notification when eviction raced with an update #144

Fix stale notification when eviction raced with an update #144

ben-manes commented Feb 18, 2017 •

edited

Loading

coveralls commented Feb 18, 2017

yonik left a comment

johnou left a comment

ben-manes commented Feb 18, 2017

johnou commented Feb 18, 2017

ben-manes commented Feb 19, 2017

ben-manes commented Feb 19, 2017

johnou commented Feb 20, 2017

ben-manes commented Feb 20, 2017

Fix stale notification when eviction raced with an update #144

Fix stale notification when eviction raced with an update #144

Conversation

ben-manes commented Feb 18, 2017 • edited Loading

coveralls commented Feb 18, 2017

yonik left a comment

Choose a reason for hiding this comment

johnou left a comment

Choose a reason for hiding this comment

ben-manes commented Feb 18, 2017

johnou commented Feb 18, 2017

ben-manes commented Feb 19, 2017

ben-manes commented Feb 19, 2017

johnou commented Feb 20, 2017

ben-manes commented Feb 20, 2017

ben-manes commented Feb 18, 2017 •

edited

Loading