Skip to content

Parallelize the Analyzer #4897

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Aug 28, 2023
Merged

Parallelize the Analyzer #4897

merged 13 commits into from
Aug 28, 2023

Conversation

gzm0
Copy link
Contributor

@gzm0 gzm0 commented Aug 19, 2023

Leads to almost 2x speedup of "Compute Reachability".

> d %>% group_by(op, variant) %>% summarize(median(t_ns))
`summarise()` has grouped output by 'op'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups:   op [2]
  op                            variant  `median(t_ns)`
  <fct>                         <fct>             <dbl>
1 Linker: Compute reachability  main         323617749 
2 Linker: Compute reachability  parallel     163542533 
3 Linker: Compute reachability  pr           165204962.
4 Refiner: Compute reachability main         255269676 
5 Refiner: Compute reachability parallel     170701844 
6 Refiner: Compute reachability pr           169508056.

@gzm0 gzm0 requested a review from sjrd August 19, 2023 17:31
@gzm0
Copy link
Contributor Author

gzm0 commented Aug 19, 2023

plot

main: baseline
parallel: the original version without the last commit in this PR
pr: this PR.

_classInfos.get(className) match {
case None =>
val loading = new LoadingClass(className)
private final class ClassLoader(implicit ec: ExecutionContext) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally isolated this into its own class to make _classInfos more isolated. Now I'm not so sure anymore it's really an improvement. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure either. It doesn't seem substantially better or worse either way. What I do find a bit problematic is that this change combined in the same commit that makes it thread-safe makes things quite a bit blurry to me. It's unclear what has actually changed in terms of thread-safety. Is it just _classInfos becoming a TrieMap? If yes, can we put that in a separate commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just _classInfos becoming a TrieMap?

Ish, yes. But doing this also implies moving the actual loading code from LodingClass to its own method (doLoad) inside the ClassLoader. Essentially, we use the _classInfos as a synchronization point to chose the thread which is supposed to trigger class loading. Before this commit, that was part of the constructor of LoadingClass.

If yes, can we put that in a separate commit?

Not trivially, but I can attempt to split it into smaller refactoring steps if you think it would be helpful.

I have attempted to suggest an ordering in this comment, but it is non-trivial since with the thread-safety, we can also leverage futures more, so this makes other code paths (notably the new different code paths for lookupClassForLinking and lookupClass) manageable (whereas w/o futures, they'd need additional helpers).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I see. Let's keep it as is, then. It's not that hard to make sense of.

@gzm0
Copy link
Contributor Author

gzm0 commented Aug 19, 2023

Raw benchmark data

logger-timings.csv

@gzm0 gzm0 force-pushed the parallel-linker branch from c420e09 to 1f37da9 Compare August 20, 2023 07:42
var methodsCalledLog: List[MethodName] = Nil
private val _instantiatedSubclasses = new AtomicReference[List[ClassInfo]](Nil)

private val methodsCalledLog = new AtomicReference[List[MethodName]](Nil)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: This is always strictly methodsCalledLog.toSet == dispatchCalledFrom.keySet. However, we need to be able to snapshot this in subclassInstantiated. TrieMap supports snapshotting, but mutable.Map doesn't.

So the options are:

  • Keep this.
  • Switch to AtomicReference[immutable.Map[MethodName, AtomicReference[List[From]]] (I have concerns about key contention on the map here).
  • Do some type hackery so we can get "snapshots" also on normal mutable maps in JS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the current option is best, but a comment wouldn't be superfluous.

@gzm0 gzm0 force-pushed the parallel-linker branch from 1f37da9 to bd9066e Compare August 20, 2023 08:00
def join(): Future[Unit] = {
tryDoWork()
promise.future
fut.map(onSuccess).onComplete {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if I should add a commit removing the continuation passing style. IIUC we wouldn't make any more futures, the code would just be more idiomatic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but it can probably equally be done as a follow-up PR, no?

@gzm0
Copy link
Contributor Author

gzm0 commented Aug 20, 2023

I have ran benchmarking with IR checking (or more notably ClassChecking) disabled as well:

logger-timings.csv
plot

  • I cannot explain why main-nocheck appears slower than main in the linker pass.
  • It is expected that disabling class checking in the first linker pass doesn't change anything in the incremental setting: Since the underlying IR files do not change, we simply bypass it (here)
  • It is expected that disabling checking in the refiner pass improves performance: Since we cannot version the full class anymore at this point (it is "stitched together"), we have to re-run ClassDef checking even in incremental passes.

(isScalaClass || isJSClass || isNativeJSClass)) {
isInstantiated = true
if ((isScalaClass || isJSClass || isNativeJSClass) &&
!_isInstantiated.getAndSet(true)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scalastyle fails on this line, claiming "Boolean expression can be simplified.

Relevant code:
https://github.com/scalastyle/scalastyle/blob/v1.0.0/src/main/scala/org/scalastyle/scalariform/SimplifyBooleanExpressionChecker.scala

I do not understand what is happening. From how I read the relevant code, it should trigger on the following operators being used with literal boolean values: ! != == && ||.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem nonsensical. But perhaps that's opportunity to make the side-effect a bit more separated from the rest of the condition:

if (!(isScalaClass || isJSClass || isNativeJSClass)) {
  // Ignore
  // TODO, shouldn't this be a linking error instead?
} else if (_!isInstantiated.getAndSet(true)) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Switching to infix notation for getAndSet or method notation for && makes the error go away. But maybe we should just deactivate the check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather deactivate the check than arbitrarily switching notation.

Copy link
Member

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Great job. I have some minor comments here and there.

(isScalaClass || isJSClass || isNativeJSClass)) {
isInstantiated = true
if ((isScalaClass || isJSClass || isNativeJSClass) &&
!_isInstantiated.getAndSet(true)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem nonsensical. But perhaps that's opportunity to make the side-effect a bit more separated from the rest of the condition:

if (!(isScalaClass || isJSClass || isNativeJSClass)) {
  // Ignore
  // TODO, shouldn't this be a linking error instead?
} else if (_!isInstantiated.getAndSet(true)) {

var methodsCalledLog: List[MethodName] = Nil
private val _instantiatedSubclasses = new AtomicReference[List[ClassInfo]](Nil)

private val methodsCalledLog = new AtomicReference[List[MethodName]](Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO the current option is best, but a comment wouldn't be superfluous.

def join(): Future[Unit] = {
tryDoWork()
promise.future
fut.map(onSuccess).onComplete {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, but it can probably equally be done as a follow-up PR, no?

assert(queue.isEmpty)
private def taskDone(): Unit = {
if (pending.decrementAndGet() == 0) {
if (pending.compareAndSet(0, -1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second one can be a direct set(-1), can't it? There's only one thread that will receive 0 as result of the decrementAndGet().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, indeed. It actually points to a weakness in our current design (which I have documented as a comment). IMO we should still proceed, but I should try and work towards removing the work tracker entirely.

@gzm0 gzm0 force-pushed the parallel-linker branch from bd9066e to 4621cbb Compare August 23, 2023 04:40
@sjrd
Copy link
Member

sjrd commented Aug 23, 2023

One more thing I just thought about: is there any way we can honor linkerConfig.parallel here? In other words, if parallel is false, use a single thread for running all the (non-IO?) tasks? It may help diagnosing issues in the future if anything comes up.

import scala.collection.concurrent.TrieMap

private[analyzer] object Platform {
def emptyThreadSafeMap[K, V]: mutable.Map[K, V] = TrieMap.empty
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, the TrieMap will not compact the removed entry, so the mememory usage will be lager than the ConcurrentHashMap, but the ConcurrentHashMap have a different trait.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't remove anything from these maps, so that doesn't seem like a relevant concern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: TrieMap supports lock-free, "opportunistic" getOrElseUpdate (might create the value multiple times, but won't if it's already in the map). ConcurrentHashMap doesn't (computeIfAbsent). That was the primary motivator to use TrieMap.

Copy link

@He-Pin He-Pin Aug 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always use a get to check before invoking the computeIfAbsnet, and the bug you mentioned about ConcurrentHashMap has been fixed in JDK 9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean with "the bug I mention". The JavaDoc is very clear that computeIfAbsent can be blocking:

The entire method invocation is performed atomically, so the function is applied at most once per key. Some attempted update operations on this map by other threads may be blocked while computation is in progress, so the computation should be short and simple, and must not attempt to update any other mappings of this map.

While this can be desirable sometimes, it's not what we want here: We do not need the guarantee that the method invocation is performed atomically, but we do want the operation to remain lock-free.

@gzm0 gzm0 force-pushed the parallel-linker branch from 4621cbb to 7494019 Compare August 26, 2023 07:27
gzm0 added 2 commits August 26, 2023 09:37
  op                            variant  `median(t_ns)`
  <fct>                         <fct>             <dbl>
1 Linker: Compute reachability  main          323617749
2 Linker: Compute reachability  parallel      163542533
3 Refiner: Compute reachability main          255269676
4 Refiner: Compute reachability parallel      170701844
Now that we can run in parallel, we can use full future power.
@gzm0 gzm0 force-pushed the parallel-linker branch from 7494019 to a28fa78 Compare August 26, 2023 07:38
@gzm0
Copy link
Contributor Author

gzm0 commented Aug 26, 2023

One more thing I just thought about: is there any way we can honor linkerConfig.parallel here? In other words, if parallel is false, use a single thread for running all the (non-IO?) tasks? It may help diagnosing issues in the future if anything comes up.

That is an excellent point. But I'm not sure we can do so trivially.

Options I see:

  • Keep the old code (with workQueue) around.

    • Might block us from further improvements in the Analyzer in the future (e.g. rely on futures more).
  • Build our own executor that has its own queue and uses the calling thread (before waiting for workTracker.future) to process things.

    This might be tricky to get right without deadlocks (and is likely quite a bit of code).

  • Ignore the passed execution context and create our own one (with https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/Executors.html#newSingleThreadScheduledExecutor--).

    Simplest solution, but spawns a new, unnecessary thread (the again, with parallel = false performance isn't the goal). Needs a bit of cross-platform wiring, but likely not too bad.

  • Do not honor linkerConfig.parallel.

    Status quo. OK as long as things work, but if they don't having the debug option would indeed be nice.

WDYT? I'm leaning towards option 3. Note that with 2/3, we have the choice of running I/O on that single executor thread as well. Which we probably should for simplicity?

@sjrd
Copy link
Member

sjrd commented Aug 26, 2023

Option 3 with IO on the same unique thread seems like the most reliable and simple option, so I'd be in favor of that, yes.

@gzm0
Copy link
Contributor Author

gzm0 commented Aug 26, 2023

I have added an additional commit honoring the parallel config. I believe I have addressed all the comments at this point.

@gzm0 gzm0 requested a review from sjrd August 26, 2023 13:34
@gzm0 gzm0 force-pushed the parallel-linker branch from 7dd2e3d to 9a0143c Compare August 27, 2023 07:36
@gzm0
Copy link
Contributor Author

gzm0 commented Aug 27, 2023

I have tested manually that helloworld2_12/fastLinkJS and testSuite2_12/Test/fastLinkJS complete with parallel = false.

Copy link
Member

@sjrd sjrd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! :)

@sjrd sjrd merged commit 19d3ce0 into scala-js:main Aug 28, 2023
@gzm0 gzm0 deleted the parallel-linker branch August 30, 2023 04:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants