Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize traverse #3283

Merged
merged 2 commits into from
Feb 3, 2020
Merged

Conversation

travisbrown
Copy link
Contributor

@travisbrown travisbrown commented Feb 3, 2020

tl;dr: Traversing a List or Vector is probably the most common operation people do with this library, and the current implementations for these types have some room for optimization, with the changes in this PR giving up to 20% more throughput for List and 201% for Vector.

I've put together a benchmark that compares the current traverse for List with two new implementations:

def traverseFoldRight[G[_], A, B](fa: List[A])(f: A => G[B])(implicit G: Applicative[G]): G[List[B]] =
  fa.foldRight[Eval[G[List[B]]]](Always(G.pure(Nil))) {
      case (h, t) => G.map2Eval(f(h), Eval.defer(t))(_ :: _)
    }
    .value

def traverseRec[G[_], A, B](fa: List[A])(f: A => G[B])(implicit G: Applicative[G]): G[List[B]] = {
  def loop(fa: List[A]): Eval[G[List[B]]] = fa match {
    case h :: t => G.map2Eval(f(h), Eval.defer(loop(t)))(_ :: _)
    case Nil    => Eval.now(G.pure(Nil))
  }
  loop(fa).value
}

The first uses the standard library's foldRight with an explicit Eval accumulator, instead of the Eval-based foldRight on Foldable. The second effectively just inlines the call the Cat's foldRight in the current implementation.

Both of these seem substantially faster than the current implementation when traversing with Right(_) (results shown for list sizes 101, 102, 103, and 104; higher numbers are better; all results are shown for Scala 2.13, but 2.12 is similar):

Benchmark                                       Mode  Cnt        Score       Error  Units
TraverseListBench.traverseCats1                thrpt   20  2522518.924 ±  4088.634  ops/s
TraverseListBench.traverseCats2                thrpt   20   284154.249 ±  1386.556  ops/s
TraverseListBench.traverseCats3                thrpt   20    26490.162 ±   764.213  ops/s
TraverseListBench.traverseCats4                thrpt   20     2645.683 ±     2.779  ops/s
TraverseListBench.traverseFoldRight1           thrpt   20  3109083.857 ± 10545.898  ops/s
TraverseListBench.traverseFoldRight2           thrpt   20   325352.357 ±   482.879  ops/s
TraverseListBench.traverseFoldRight3           thrpt   20    26009.438 ±    89.164  ops/s
TraverseListBench.traverseFoldRight4           thrpt   20     2609.019 ±    15.222  ops/s
TraverseListBench.traverseRec1                 thrpt   20  3053589.800 ±  8292.173  ops/s
TraverseListBench.traverseRec2                 thrpt   20   340495.016 ±   803.026  ops/s
TraverseListBench.traverseRec3                 thrpt   20    30449.658 ±    58.876  ops/s
TraverseListBench.traverseRec4                 thrpt   20     2945.153 ±     3.059  ops/s

The loop implementation also allocates less:

Benchmark                                                Mode  Cnt        Score        Error   Units
TraverseListBench.traverseCats1:gc.alloc.rate.norm      thrpt    5     2056.000 ±      0.001    B/op
TraverseListBench.traverseCats2:gc.alloc.rate.norm      thrpt    5    18616.000 ±      0.001    B/op
TraverseListBench.traverseCats3:gc.alloc.rate.norm      thrpt    5   198168.002 ±      0.001    B/op
TraverseListBench.traverseCats4:gc.alloc.rate.norm      thrpt    5  1998168.018 ±      0.012    B/op
TraverseListBench.traverseRec1:gc.alloc.rate.norm       thrpt    5     1728.000 ±      0.001    B/op
TraverseListBench.traverseRec2:gc.alloc.rate.norm       thrpt    5    16848.000 ±      0.001    B/op
TraverseListBench.traverseRec3:gc.alloc.rate.norm       thrpt    5   182016.001 ±      0.001    B/op
TraverseListBench.traverseRec4:gc.alloc.rate.norm       thrpt    5  1838016.016 ±      0.009    B/op

The results for a more complex parsing operation in ValidatedNel are similar.

I've done a similar comparison for Vector, but with an additional new candidate:

def traverseIter[G[_], A, B](fa: Vector[A])(f: A => G[B])(implicit G: Applicative[G]): G[Vector[B]] = {
  var i = fa.length - 1
  var current: Eval[G[Vector[B]]] = Eval.now(G.pure(Vector.empty))

  while (i >= 0) {
    current = G.map2Eval(f(fa(i)), current)(_ +: _)
    i -= 1
  }

  current.value
}

I've also included implementations of all three new approaches for Vector that accumulate the result in a List and then convert at the end.


Benchmark                                       Mode  Cnt        Score       Error  Units
TraverseVectorBench.traverseCats1              thrpt   20  1716229.387 ± 12903.229  ops/s
TraverseVectorBench.traverseCats2              thrpt   20    98885.248 ±   187.467  ops/s
TraverseVectorBench.traverseCats3              thrpt   20     8095.486 ±    86.914  ops/s
TraverseVectorBench.traverseCats4              thrpt   20      786.319 ±     9.118  ops/s
TraverseVectorBench.traverseFoldRight1         thrpt   20  1940918.153 ±  4653.590  ops/s
TraverseVectorBench.traverseFoldRight2         thrpt   20    99467.268 ±   151.832  ops/s
TraverseVectorBench.traverseFoldRight3         thrpt   20     7906.035 ±    25.461  ops/s
TraverseVectorBench.traverseFoldRight4         thrpt   20      768.825 ±     4.546  ops/s
TraverseVectorBench.traverseFoldRightViaList1  thrpt   20  2444454.783 ± 27744.679  ops/s
TraverseVectorBench.traverseFoldRightViaList2  thrpt   20   250501.174 ±  1286.555  ops/s
TraverseVectorBench.traverseFoldRightViaList3  thrpt   20    22235.074 ±    55.709  ops/s
TraverseVectorBench.traverseFoldRightViaList4  thrpt   20     2195.451 ±     3.826  ops/s
TraverseVectorBench.traverseIter1              thrpt   20  1845529.178 ±  1799.628  ops/s
TraverseVectorBench.traverseIter2              thrpt   20    98067.794 ±   408.574  ops/s
TraverseVectorBench.traverseIter3              thrpt   20     8032.515 ±    49.259  ops/s
TraverseVectorBench.traverseIter4              thrpt   20      765.116 ±     3.384  ops/s
TraverseVectorBench.traverseIterViaList1       thrpt   20  2409083.473 ±  2141.445  ops/s
TraverseVectorBench.traverseIterViaList2       thrpt   20   255852.261 ±   488.992  ops/s
TraverseVectorBench.traverseIterViaList3       thrpt   20    22926.371 ±   134.168  ops/s
TraverseVectorBench.traverseIterViaList4       thrpt   20     2160.138 ±     2.741  ops/s
TraverseVectorBench.traverseRec1               thrpt   20  1994461.861 ± 12887.488  ops/s
TraverseVectorBench.traverseRec2               thrpt   20   101952.832 ±   233.014  ops/s
TraverseVectorBench.traverseRec3               thrpt   20     8120.346 ±    82.347  ops/s
TraverseVectorBench.traverseRec4               thrpt   20      792.459 ±    28.741  ops/s
TraverseVectorBench.traverseRecViaList1        thrpt   20  2628040.643 ± 35243.584  ops/s
TraverseVectorBench.traverseRecViaList2        thrpt   20   279305.281 ±   381.740  ops/s
TraverseVectorBench.traverseRecViaList3        thrpt   20    24388.417 ±    40.075  ops/s
TraverseVectorBench.traverseRecViaList4        thrpt   20     2728.899 ±     2.560  ops/s

Again the loop implementation is fastest (but the version that accumulates in a list, not the one that builds a vector directly).

I've made these changes for List, Vector, and 2.13's ArraySeq, but not for Chain, Stream, or LazyList.

@codecov-io
Copy link

codecov-io commented Feb 3, 2020

Codecov Report

Merging #3283 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #3283      +/-   ##
==========================================
- Coverage   93.14%   93.14%   -0.01%     
==========================================
  Files         378      378              
  Lines        7576     7575       -1     
  Branches      203      194       -9     
==========================================
- Hits         7057     7056       -1     
  Misses        519      519
Flag Coverage Δ
#scala_version_212 93.39% <100%> (-0.01%) ⬇️
#scala_version_213 92.91% <100%> (-0.02%) ⬇️
Impacted Files Coverage Δ
core/src/main/scala/cats/instances/list.scala 100% <100%> (ø) ⬆️
...src/main/scala-2.13+/cats/instances/arraySeq.scala 100% <100%> (ø) ⬆️
core/src/main/scala/cats/instances/vector.scala 100% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update bb7a180...e5e2968. Read the comment docs.

Copy link
Member

@LukaJCB LukaJCB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you!

Copy link
Contributor

@kailuowang kailuowang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I am now curious, are there other opportunities for optimization by inlining foldRight?

Copy link
Member

@djspiewak djspiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At first glance, seems like we could potentially shave even more off of this with some dirtier internal tricks. This is a great start though.

Performance improvements on generic typeclass operations generally don't particularly interest me, since they should never be in the hot path anyway, but faster is always better than slower, regardless of the context.

@djspiewak djspiewak merged commit 56c1527 into typelevel:master Feb 3, 2020
@travisbrown travisbrown added this to the 2.2.0-M1 milestone Feb 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants