incorrect locking in FinagleChirper Cache #164

tkrodriguez · 2019-06-25T17:20:54Z

We'd been seeing frequency hangs in finagle-chirper and investigation indicated that the Cache HashMap was getting corrupted with DefaultEntry buckets have a next pointer which points at itself. The cache update is syntactically with in a lock.synchronized block but since there's a future it's not executed while that lock is held. Adding a lock.wait(1) before the cache update shows that the lock isn't held when the cache is updated.

index 12b6739..86c8ccf 100644
--- a/benchmarks/twitter-finagle/src/main/scala/org/renaissance/twitter/finagle/FinagleChirper.scala
+++ b/benchmarks/twitter-finagle/src/main/scala/org/renaissance/twitter/finagle/FinagleChirper.scala
@@ -242,6 +242,11 @@ class FinagleChirper extends RenaissanceBenchmark {
           val request = Request(Method.Get, "/api/feed?username=" + username)
           val responseFuture = service(request)
           for (response <- responseFuture) yield {
+	    try {
+	      lock.wait(1);
+	    } catch {
+              case e: InterruptedException => e.printStackTrace();
+	    }
             cache(username) = response.content
             response
           }

which throws this

Jun 25, 2019 10:09:25 AM com.twitter.finagle.util.DefaultMonitor logWithRemoteInfo
WARNING: Exception propagated to the default monitor (upstream address: /127.0.0.1:54922, downstream address: n/a, label: ).
java.lang.IllegalMonitorStateException
	at java.lang.Object.wait(Native Method)
	at org.renaissance.twitter.finagle.FinagleChirper$Cache$$anonfun$apply$12.apply(FinagleChirper.scala:246)
	at org.renaissance.twitter.finagle.FinagleChirper$Cache$$anonfun$apply$12.apply(FinagleChirper.scala:244)

The text was updated successfully, but these errors were encountered:

farquet · 2019-06-26T06:17:19Z

Thanks @tkrodriguez for the investigation !
I think synchronizing only the count increment and changing val cache = new mutable.HashMap[String, Buf] into val cache = new concurrent.TrieMap[String, Buf] should work since [1] :

A concurrent.TrieMap is a concurrent hash-trie or TrieMap is a concurrent thread-safe lock-free implementation of a hash array mapped trie. It is used to implement the concurrent map abstraction.

I'll wait for @axel22 to comment though since some Prokopec guy seems to be behind the concurrent.TrieMap scala library. [2] ;)

[1] https://www.scala-lang.org/api/2.12.2/scala/collection/concurrent/TrieMap.html
[2] http://lampwww.epfl.ch/~prokopec/ctries-snapshot.pdf

tkrodriguez · 2019-06-26T18:43:00Z

Maybe twitter has already fixed this issue in their version of the code? It might be nice to be in sync with them though I don't think it matters much here. Using a concurrent data structure instead of a synchronized one might affect throughput though.

farquet · 2019-06-27T08:33:38Z

Well, the hashmap being corrupt by this missing lock is in our benchmark code, not in Finagle's. What kind of fix would you expect on the Finagle side ?
I don't think a data structure change would change the behavior significantly since the hot code is probably somewhere else, but it might hit throughput a bit indeed (which is not a problem IMO). But keeping the hashmap with a proper lock is also an option.

tkrodriguez · 2019-06-27T16:44:00Z

Ok. i wasn't clear where the boundary between our code and their code was. Since it's our harness code then we can do whatever we want. From that perspective a lock free implementation might be best but I don't have strong opinions otherwise.

axel22 · 2019-06-28T13:53:07Z

Thanks for the catch and the great analysis.
This is a frequent error with future-related code that interacts with actors or services.
The code within the for expression is a callback handler for the future, and it will indeed not serialize with the code belonging to the service.
This is a generally lacking aspect of most future implementations - they don't integrate well with other sources of concurrency. Here, in particular, they should run in the same scheduler as the Service object.

I think that a concurrent hash map is a way to go. TrieMap could be one solution, yes.
In my opinion, updates are infrequent and will not cause contention (also, they are on different keys).

farquet added the bug Something isn't working label Jun 26, 2019

farquet added this to the 1.0.0 milestone Jun 26, 2019

farquet added a commit that referenced this issue Jul 1, 2019

Fix for finagle-chirper locking problem #164

ab2e7cd

farquet mentioned this issue Jul 1, 2019

Fix for finagle-chirper locking problem #168

Merged

farquet closed this as completed Jul 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

incorrect locking in FinagleChirper Cache #164

incorrect locking in FinagleChirper Cache #164

tkrodriguez commented Jun 25, 2019

farquet commented Jun 26, 2019

tkrodriguez commented Jun 26, 2019

farquet commented Jun 27, 2019

tkrodriguez commented Jun 27, 2019

axel22 commented Jun 28, 2019

incorrect locking in FinagleChirper Cache #164

incorrect locking in FinagleChirper Cache #164

Comments

tkrodriguez commented Jun 25, 2019

farquet commented Jun 26, 2019

tkrodriguez commented Jun 26, 2019

farquet commented Jun 27, 2019

tkrodriguez commented Jun 27, 2019

axel22 commented Jun 28, 2019