-
Notifications
You must be signed in to change notification settings - Fork 5
Add first proper benchmarks #17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ify benchmarks into a single subproject.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
devirtualize neighbor and edge access
can you please elaborate?
Sure! Timings look like
So we can have a 2x speedup from the "devirtualized" optimization. The slowdown for the unshuffled case is presumably due to preventing inlining. The slowdown for the shuffled case is presumably due to branch misses that mess up the pipeline (due to the long reorder buffer, a branch miss can be much more expensive than the 20 cycle sticker price). Some relevant perf stats:
In other words, we see that the devirtualized variant has no branch misses, even in the shuffled case; and due to inlining it can run with only ~80 instructions. In contrast, the virtual variant has a megamorphic call site that prevents inlining and balloons instruction count to ~170; more to the point, it has ~0.9 mis-predicted branches. These are indirect calls (i.e. via branch target buffer, not via branch history table) which is even worse. Unfortunately perfasm output doesn't do what I want for this specific case (I'm probably holding it wrong?). The benchmark in question looks like @Benchmark
def orderSumVirtual(blackhole: Blackhole): Int = {
var sumOrder = 0
for (node <- nodeStart.iterator.asInstanceOf[Iterator[v2.nodes.AstNode]]) {
sumOrder += node.order
}
if (blackhole != null) blackhole.consume(sumOrder)
sumOrder
}
@Benchmark
def orderSumDevirtualized(blackhole: Blackhole): Int = {
var sumOrder = 0
val prop = nodeStart.head.graph.schema.getPropertyIdByLabel("ORDER")
for (node <- nodeStart) {
sumOrder += odb2.Accessors.getNodePropertySingle(node.graph, node.nodeKind, prop, node.seq(), -1)
}
if (blackhole != null) blackhole.consume(sumOrder)
sumOrder
} The actual code for trait AstNode extends StoredNode with AstNodeBase {
def order: Int
...
}
class Annotation(graph_4762: odb2.Graph, seq_4762: Int) extends StoredNode(graph_4762, 0.toShort, seq_4762) with Expression {
def order: Int = odb2.Accessors.getNodePropertySingle(graph, nodeKind, 41, seq, -1: Int)
...
} This means that we need to do an I'd like to note that the implementation of
|
Some changes in graph accessors and schemagen. Add JMH benchmarks. Unify benchmarks into a single subproject.
The opsPerInvocation annoyed me enough to try to upstream it at openjdk/jmh#97
I think we now can start discussing about whether to try to devirtualize neighbor and edge access.