Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incorrect locking in FinagleChirper Cache #164

Closed
tkrodriguez opened this issue Jun 25, 2019 · 5 comments
Closed

incorrect locking in FinagleChirper Cache #164

tkrodriguez opened this issue Jun 25, 2019 · 5 comments
Labels
bug Something isn't working
Milestone

Comments

@tkrodriguez
Copy link
Collaborator

We'd been seeing frequency hangs in finagle-chirper and investigation indicated that the Cache HashMap was getting corrupted with DefaultEntry buckets have a next pointer which points at itself. The cache update is syntactically with in a lock.synchronized block but since there's a future it's not executed while that lock is held. Adding a lock.wait(1) before the cache update shows that the lock isn't held when the cache is updated.

index 12b6739..86c8ccf 100644
--- a/benchmarks/twitter-finagle/src/main/scala/org/renaissance/twitter/finagle/FinagleChirper.scala
+++ b/benchmarks/twitter-finagle/src/main/scala/org/renaissance/twitter/finagle/FinagleChirper.scala
@@ -242,6 +242,11 @@ class FinagleChirper extends RenaissanceBenchmark {
           val request = Request(Method.Get, "/api/feed?username=" + username)
           val responseFuture = service(request)
           for (response <- responseFuture) yield {
+	    try {
+	      lock.wait(1);
+	    } catch {
+              case e: InterruptedException => e.printStackTrace();
+	    }
             cache(username) = response.content
             response
           }

which throws this

Jun 25, 2019 10:09:25 AM com.twitter.finagle.util.DefaultMonitor logWithRemoteInfo
WARNING: Exception propagated to the default monitor (upstream address: /127.0.0.1:54922, downstream address: n/a, label: ).
java.lang.IllegalMonitorStateException
	at java.lang.Object.wait(Native Method)
	at org.renaissance.twitter.finagle.FinagleChirper$Cache$$anonfun$apply$12.apply(FinagleChirper.scala:246)
	at org.renaissance.twitter.finagle.FinagleChirper$Cache$$anonfun$apply$12.apply(FinagleChirper.scala:244)
@farquet farquet added the bug Something isn't working label Jun 26, 2019
@farquet farquet added this to the 1.0.0 milestone Jun 26, 2019
@farquet
Copy link
Collaborator

farquet commented Jun 26, 2019

Thanks @tkrodriguez for the investigation !
I think synchronizing only the count increment and changing val cache = new mutable.HashMap[String, Buf] into val cache = new concurrent.TrieMap[String, Buf] should work since [1] :

A concurrent.TrieMap is a concurrent hash-trie or TrieMap is a concurrent thread-safe lock-free implementation of a hash array mapped trie. It is used to implement the concurrent map abstraction.

I'll wait for @axel22 to comment though since some Prokopec guy seems to be behind the concurrent.TrieMap scala library. [2] ;)

[1] https://www.scala-lang.org/api/2.12.2/scala/collection/concurrent/TrieMap.html
[2] http://lampwww.epfl.ch/~prokopec/ctries-snapshot.pdf

@tkrodriguez
Copy link
Collaborator Author

Maybe twitter has already fixed this issue in their version of the code? It might be nice to be in sync with them though I don't think it matters much here. Using a concurrent data structure instead of a synchronized one might affect throughput though.

@farquet
Copy link
Collaborator

farquet commented Jun 27, 2019

Well, the hashmap being corrupt by this missing lock is in our benchmark code, not in Finagle's. What kind of fix would you expect on the Finagle side ?
I don't think a data structure change would change the behavior significantly since the hot code is probably somewhere else, but it might hit throughput a bit indeed (which is not a problem IMO). But keeping the hashmap with a proper lock is also an option.

@tkrodriguez
Copy link
Collaborator Author

Ok. i wasn't clear where the boundary between our code and their code was. Since it's our harness code then we can do whatever we want. From that perspective a lock free implementation might be best but I don't have strong opinions otherwise.

@axel22
Copy link
Member

axel22 commented Jun 28, 2019

Thanks for the catch and the great analysis.
This is a frequent error with future-related code that interacts with actors or services.
The code within the for expression is a callback handler for the future, and it will indeed not serialize with the code belonging to the service.
This is a generally lacking aspect of most future implementations - they don't integrate well with other sources of concurrency. Here, in particular, they should run in the same scheduler as the Service object.

I think that a concurrent hash map is a way to go. TrieMap could be one solution, yes.
In my opinion, updates are infrequent and will not cause contention (also, they are on different keys).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants