Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stale notification when eviction raced with an update #144

Merged
merged 1 commit into from
Feb 18, 2017
Merged

Conversation

ben-manes
Copy link
Owner

@ben-manes ben-manes commented Feb 18, 2017

SOLR-10141 (thanks Yonik!)

As an optimization, an update is allowed to bypass the hash map and
synchronize on the read entry directly. In this block it checks
liveliness, perform the mutation, and notifies the writer. This avoids
more expensive computations through the map.

Previously, an eviction was performed in a computation to remove the
entry and notify the writer, or resurrect. Inside the computation the
entry was not synchronized on, and that was done only after it was
removed from the table. The removal listener was notified with the
value initially read at the start of this method.

This allowed an update to modify the value while (or after) the entry
was removing it from the hash table. This led to notifying the writer
and removal listener with the stale value. Because the writer must
be called exclusively with the mutation, this computation must use
a synchronized guard. Otherwise we might have preferred to re-read
the value when notifying the listener. This adds a slight penalty on
eviction (async) while allowing put to still be fast (but may block).

putSlow was removed as not longer needed. It was a computation-based
write that was safe from this issue. But it was only used when the new
weight was zero, as that update race would cause an incorrect eviction.
Now that the primary path is safe from this race, its unnecessary.

@yonik please review (thanks for finding this!)
@johnou (I think you mentioned that you may have observed this)

@ben-manes ben-manes force-pushed the solr branch 2 times, most recently from 41376d6 to f2dd9f0 Compare February 18, 2017 06:01
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 93.757% when pulling f2dd9f0 on solr into c5cc2d4 on master.

Copy link

@yonik yonik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good Ben, thanks for the quick fix!

Copy link

@johnou johnou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads up!

@ben-manes
Copy link
Owner Author

Hey @johnou, you might be interested in this slide deck. I didn't know your email so couldn't ping you about it, but seemed to be in your area of interest. It was for a private talk at a startup nearby a week ago.

[SOLR-10141](https://issues.apache.org/jira/browse/SOLR-10141) (thanks Yonik!)

As an optimization, an update is allowed to bypass the hash map and
synchronize on the read entry directly. In this block it checks
liveliness, performs the mutation, and notifies the writer. This avoids
more expensive computations through the map.

Previously, an eviction was performed in a computation to remove the
entry and notify the writer, or resurrect. Inside the computation the
entry was not synchronized on and that was done only after it was
removed from the table. The removal listener was notified with the
value initially read at the start of this method.

This allowed an update to modify the value while (or after) the entry
was being removed from the hash table. This led to notifying the writer
and removal listener with the stale value. Because the writer must
be called exclusively with the mutation, this computation must use
a synchronized guard. Otherwise we might have preferred to re-read
the value when notifying the listener. This adds a slight penalty on
eviction (usually async) while allowing `put` to still be fast (but
may block a little more often).

`putSlow` was removed as it is no longer needed. It was a computation-
based write that was safe from this issue. But it was only used when the
new weight was zero, as that update race would cause an incorrect
eviction. Now that the primary path is safe from this race, it is
unnecessary.
@johnou
Copy link

johnou commented Feb 18, 2017

@ben-manes you bet I am, always welcome to reach out to me at johno.crawford@gmail.com

@ben-manes
Copy link
Owner Author

I fixed a similar bug in when clearing the cache usinginvalidateAll() or asMap().clear(). I added a simpler version of @yonik's test that fails prior to the changes. I've also added unit tests that rely on explicit lock ordering of the implementation to deterministically validate the eviction and clear bugs.

I've run all the tests locally and will release after the CI confirms.

@ben-manes
Copy link
Owner Author

Released 2.4.0

@johnou
Copy link

johnou commented Feb 20, 2017

@ben-manes did you ever hear back from Doug about the tickless hierarchical timing wheel in Java?

@ben-manes
Copy link
Owner Author

When I brought it up he said it was on his backlog too, but I haven't talked with him about it since. I think jdk9 has a dedicated scheduler thread for the common fjp. Perhaps if its overloaded he'll add a ticking one.

I would like to add that feature by a jdk9 release, since I'll bump up the version and fix API quirks. I don't think it's hard, just requires a little motivation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants