[SPARK-29061][SQL] Prints bytecode statistics in debugCodegen #25766

maropu · 2019-09-12T00:44:02Z

What changes were proposed in this pull request?

This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, debugCodegen. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable debugCodegen to print these metrics as following;

scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L])
+- *(1) LocalTableScan [v#0]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
...

Why are the changes needed?

For efficient developments

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually tested

maropu · 2019-09-12T00:44:50Z

How about this? @cloud-fan @rednaxelafx @viirya @mgaido91

SparkQA · 2019-09-12T05:02:12Z

Test build #110493 has finished for PR 25766 at commit 1b27080.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ByteCodeStats(maxClassCodeSize: Int, maxMethodCodeSize: Int, maxConstPoolSize: Int)
* Returns the bytecode statistics (max class bytecode size, max method bytecode size, and

rednaxelafx

I like this PR in general. Left some minor comments inline below.

rednaxelafx · 2019-09-12T06:22:46Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

-   * Returns the max bytecode size of the generated functions by inspecting janino private fields.
-   * Also, this method updates the metrics information.
+   * Returns the bytecode statistics (max class bytecode size, max method bytecode size, and
+   * max constant pool size) of generated classes by inspecting janino private fields.inspecting


Nit: inspecting janino private fields.inspecting janino private fields seems weird.
Also: could we always spell "Janino" as such?

oh... I'll fix soon. Thanks!

rednaxelafx · 2019-09-12T06:24:48Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

          }
        }
-        Some(stats)
+        (classCodeSize, methodCodeSizes.max, constPoolSize)


I'm curious: now that we've got a nice new ByteCodeStats type, why use a tuple here?

No strong reason... I just did because I avoided the longer statement in https://github.com/apache/spark/pull/25766/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR1382;

ByteCodeStats(codeStats.reduce { case (v1, v2) => (Math.max(v1.maxClassCodeSize, v2.maxClassCodeSize), Math.max(v1.maxMethodCodeSize, v2.maxMethodCodeSize), Math.max(v1.maxConstPoolSize, v2.maxConstPoolSize)) })

If there are other reviewers who like that, I'll update.

I find named fields much more readable than _1 _2 _3. In fact even with tuples I may have written the code like:

ByteCodeStats(codeStats.reduce { case ((maxClassCodeSize1, maxMethodCodeSize1, maxConstPoolSize), (maxClassCodeSize2, maxMethodCodeSize2, maxConstPoolSize2)) => (Math.max(maxClassCodeSize1, maxClassCodeSize2), Math.max(maxMethodCodeSize1, maxMethodCodeSize2), Math.max(maxConstPoolSize1, maxConstPoolSize2)) })

and...I'd say the v1.maxClassCodeSize version looks better here.

How about the latest code? I added a new metric (# of inner classes), so using a tuple in that part is ok?

cloud-fan · 2019-09-12T07:48:10Z

I like this feature, thanks to @maropu for this good idea!

mgaido91

can we also add some UTs?

mgaido91 · 2019-09-12T07:45:06Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+/**
+ * Java bytecode statistics of a compiled class by Janino.
+ */
+case class ByteCodeStats(maxClassCodeSize: Int, maxMethodCodeSize: Int, maxConstPoolSize: Int)


what about adding also the number od inner classes?

It looks nice.

mgaido91 · 2019-09-12T07:45:27Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+
+object ByteCodeStats {
+
+  val unavailable = ByteCodeStats(-1, -1, -1)


nit:

Suggested change

val unavailable = ByteCodeStats(-1, -1, -1)

val UNAVAILABLE = ByteCodeStats(-1, -1, -1)

mgaido91 · 2019-09-12T07:45:59Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+
+  val unavailable = ByteCodeStats(-1, -1, -1)
+
+  def apply(codeStats: (Int, Int, Int)): ByteCodeStats = {


mmmh..do we really need this?

mgaido91 · 2019-09-12T07:47:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala

-    for (((subtree, code), i) <- codegenSeq.zipWithIndex) {
-      append(s"== Subtree ${i + 1} / ${codegenSeq.size} ==\n")
+    for (((subtree, code, codeStats), i) <- codegenSeq.zipWithIndex) {
+      val codeStatsStr = s"maxClassCodeSize:${codeStats.maxClassCodeSize} " +


nit: what about separate them by semicolumn?

You suggested this?

== Subtree 1 / 2 (maxClassCodeSize:3689; maxMethodCodeSize:226; maxConstantPoolSize:167) ==

yes, actually if you prefer any other separator, I just find it more readable with a separator

Yea, I think that suggested one looks ok to me. Thanks!

kiszk · 2019-09-12T10:51:55Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

    }
+
+    ByteCodeStats(codeStats.reduce[(Int, Int, Int)] { case (v1, v2) =>
+      (Math.max(v1._1, v2._1), Math.max(v1._2, v2._2), Math.max(v1._3, v2._3))


I like this direction.

May we see these values regarding different classes? Is it better to show class name and method name, too?

Currently, this pr prints statistics per a whole-stage codegen entry, so the current one looks ok to me.

SparkQA · 2019-09-12T15:08:08Z

Test build #110515 has finished for PR 25766 at commit fa4234c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ByteCodeStats(
* Returns the bytecode statistics (max class bytecode size, max method bytecode size,

viirya

Overall, I think this is good and useful.

cloud-fan · 2019-09-12T15:42:30Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

    codeAttrField.setAccessible(true)
-    val codeSizes = classes.flatMap { case (_, classBytes) =>
-      CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.update(classBytes.length)
+    val codeStats = classes.map { case (_, classBytes) =>


I would like to make the code more readable, by

val (classSizes, maxMethodSizes, constPoolSize) = classes.map....unzip3 ByteCodeStats( maxClassCodeSize = classSizes.max, maxMethodCodeSize = maxMethodSizes.max, maxConstPoolSize = constPoolSize.max, // Minus 2 for `GeneratedClass` and an outer-most generated class numInnerClasses = classSizes.size - 2)

cloud-fan · 2019-09-12T15:47:10Z

sql/core/src/test/scala/org/apache/spark/sql/execution/debug/DebuggingSuite.scala

+  test("Prints bytecode statistics in debugCodegen") {
+    Seq(("SELECT sum(v) FROM VALUES(1) t(v)", (0, 0)),
+      // We expect HashAggregate uses an inner class for fast hash maps
+      // in partial aggregates with keys.


I'd like to avoid end-to-end tests in this case. It's highly coupled with how we codegen these operators and is easy to break if we change the implementation in the future.

Can we add some UT that calls CodeGenerator.compile directly?

ok, I'll try.

How about the latest test? https://github.com/apache/spark/pull/25766/files#diff-8fcc5aeeefc8c2e921028d4b730d7d55R119

maropu · 2019-09-13T01:38:59Z

sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala

+      } else {
+        ""
+      }
+      val codeStatsStr = s"maxClassCodeSize:${codeStats.maxClassCodeSize}; " +


I added one more metric for the ratio of an used constant pool like maxConstantPoolSize:130(0.20% used);

scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) == *(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L]) +- *(1) LocalTableScan [v#0]

cc: @rednaxelafx @kiszk

I am neutral on this. I am bit worry about the wider width regarding ease of reading._

@maropu Could you elaborate on your idea to add this?

How about adding ratio of maxMethodCodeSize / CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT, too?

Yea, it might be worth adding it though, this pr already merged. We could revisit this in future if need this metric for debugging...

viirya · 2019-09-13T04:24:25Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+ * Java bytecode statistics of a compiled class by Janino.
+ */
+case class ByteCodeStats(
+  maxClassCodeSize: Int, maxMethodCodeSize: Int, maxConstPoolSize: Int, numInnerClasses: Int)


A ByteCodeStats matches to a compiled class? maxClassCodeSize is for max inner class code size?

The current code just collects the max size among a compiled class and inner classes. But, on second thoughs, I think now we don't need to print the class size cuz IIUC the size is not related to the JVM limits: https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.11

#25766 (comment)
WDYT? @kiszk

SparkQA · 2019-09-13T05:14:29Z

Test build #110554 has finished for PR 25766 at commit a5885e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2019-09-13T06:38:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/debug/package.scala

+      } else {
+        ""
+      }
+      val codeStatsStr = s"maxClassCodeSize:${codeStats.maxClassCodeSize}; " +


How about maxClassSize instead of maxClassCodeSize? maxClassCodeSize may imply max bytecode size in a class.

SparkQA · 2019-09-13T23:41:36Z

Test build #110579 has finished for PR 25766 at commit be268de.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ByteCodeStats(maxMethodCodeSize: Int, maxConstPoolSize: Int, numInnerClasses: Int)

SparkQA · 2019-09-14T04:05:33Z

Test build #110581 has finished for PR 25766 at commit dfc6a4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class ByteCodeStats(maxMethodCodeSize: Int, maxConstPoolSize: Int, numInnerClasses: Int)

cloud-fan · 2019-09-16T13:48:19Z

thanks, merging to master!

maropu · 2019-09-17T00:18:03Z

Thanks for the reviews, all!

Fix

1b27080

dongjoon-hyun added the SQL label Sep 12, 2019

rednaxelafx reviewed Sep 12, 2019

View reviewed changes

mgaido91 reviewed Sep 12, 2019

View reviewed changes

kiszk reviewed Sep 12, 2019

View reviewed changes

maropu force-pushed the PrintBytecodeStats branch from b544336 to 4e046fe Compare September 12, 2019 12:21

Address reviews

fa4234c

maropu force-pushed the PrintBytecodeStats branch from 4e046fe to fa4234c Compare September 12, 2019 12:23

viirya reviewed Sep 12, 2019

View reviewed changes

cloud-fan reviewed Sep 12, 2019

View reviewed changes

Address reviews

a5885e3

maropu commented Sep 13, 2019

View reviewed changes

viirya reviewed Sep 13, 2019

View reviewed changes

kiszk reviewed Sep 13, 2019

View reviewed changes

Drop class size

dfc6a4c

maropu force-pushed the PrintBytecodeStats branch from be268de to dfc6a4c Compare September 14, 2019 00:23

cloud-fan closed this in 6297287 Sep 16, 2019


		object ByteCodeStats {

		val unavailable = ByteCodeStats(-1, -1, -1)


		val unavailable = ByteCodeStats(-1, -1, -1)

		def apply(codeStats: (Int, Int, Int)): ByteCodeStats = {

[SPARK-29061][SQL] Prints bytecode statistics in debugCodegen #25766

[SPARK-29061][SQL] Prints bytecode statistics in debugCodegen #25766

Uh oh!

Conversation

maropu commented Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

maropu commented Sep 12, 2019

Uh oh!

SparkQA commented Sep 12, 2019

Uh oh!

rednaxelafx left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu Sep 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Sep 12, 2019

Uh oh!

mgaido91 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 12, 2019

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Sep 12, 2019 •

edited

Loading

maropu Sep 12, 2019 •

edited

Loading

maropu Sep 13, 2019 •

edited

Loading