Skip to content

Conversation

@maropu
Copy link
Member

@maropu maropu commented Sep 12, 2019

What changes were proposed in this pull request?

This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, debugCodegen. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable debugCodegen to print these metrics as following;

scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L])
+- *(1) LocalTableScan [v#0]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
...

Why are the changes needed?

For efficient developments

Does this PR introduce any user-facing change?

No

How was this patch tested?

Manually tested

@maropu
Copy link
Member Author

maropu commented Sep 12, 2019

How about this? @cloud-fan @rednaxelafx @viirya @mgaido91

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110493 has finished for PR 25766 at commit 1b27080.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ByteCodeStats(maxClassCodeSize: Int, maxMethodCodeSize: Int, maxConstPoolSize: Int)
  • * Returns the bytecode statistics (max class bytecode size, max method bytecode size, and

Copy link
Contributor

@rednaxelafx rednaxelafx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this PR in general. Left some minor comments inline below.

* Returns the max bytecode size of the generated functions by inspecting janino private fields.
* Also, this method updates the metrics information.
* Returns the bytecode statistics (max class bytecode size, max method bytecode size, and
* max constant pool size) of generated classes by inspecting janino private fields.inspecting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: inspecting janino private fields.inspecting janino private fields seems weird.
Also: could we always spell "Janino" as such?

Copy link
Member Author

@maropu maropu Sep 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... I'll fix soon. Thanks!

}
}
Some(stats)
(classCodeSize, methodCodeSizes.max, constPoolSize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: now that we've got a nice new ByteCodeStats type, why use a tuple here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reason... I just did because I avoided the longer statement in https://github.com/apache/spark/pull/25766/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR1382;

    ByteCodeStats(codeStats.reduce { case (v1, v2) =>
      (Math.max(v1.maxClassCodeSize, v2.maxClassCodeSize),
        Math.max(v1.maxMethodCodeSize, v2.maxMethodCodeSize),
        Math.max(v1.maxConstPoolSize, v2.maxConstPoolSize))
    })

If there are other reviewers who like that, I'll update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find named fields much more readable than _1 _2 _3. In fact even with tuples I may have written the code like:

ByteCodeStats(codeStats.reduce { case ((maxClassCodeSize1, maxMethodCodeSize1, maxConstPoolSize), (maxClassCodeSize2, maxMethodCodeSize2, maxConstPoolSize2)) =>
      (Math.max(maxClassCodeSize1, maxClassCodeSize2),
        Math.max(maxMethodCodeSize1, maxMethodCodeSize2),
        Math.max(maxConstPoolSize1, maxConstPoolSize2))
    })

and...I'd say the v1.maxClassCodeSize version looks better here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about the latest code? I added a new metric (# of inner classes), so using a tuple in that part is ok?

@cloud-fan
Copy link
Contributor

I like this feature, thanks to @maropu for this good idea!

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we also add some UTs?

/**
* Java bytecode statistics of a compiled class by Janino.
*/
case class ByteCodeStats(maxClassCodeSize: Int, maxMethodCodeSize: Int, maxConstPoolSize: Int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about adding also the number od inner classes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks nice.


object ByteCodeStats {

val unavailable = ByteCodeStats(-1, -1, -1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
val unavailable = ByteCodeStats(-1, -1, -1)
val UNAVAILABLE = ByteCodeStats(-1, -1, -1)


val unavailable = ByteCodeStats(-1, -1, -1)

def apply(codeStats: (Int, Int, Int)): ByteCodeStats = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mmmh..do we really need this?

for (((subtree, code), i) <- codegenSeq.zipWithIndex) {
append(s"== Subtree ${i + 1} / ${codegenSeq.size} ==\n")
for (((subtree, code, codeStats), i) <- codegenSeq.zipWithIndex) {
val codeStatsStr = s"maxClassCodeSize:${codeStats.maxClassCodeSize} " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: what about separate them by semicolumn?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You suggested this?

== Subtree 1 / 2 (maxClassCodeSize:3689; maxMethodCodeSize:226; maxConstantPoolSize:167) ==

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, actually if you prefer any other separator, I just find it more readable with a separator

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think that suggested one looks ok to me. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you

}

ByteCodeStats(codeStats.reduce[(Int, Int, Int)] { case (v1, v2) =>
(Math.max(v1._1, v2._1), Math.max(v1._2, v2._2), Math.max(v1._3, v2._3))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this direction.

May we see these values regarding different classes? Is it better to show class name and method name, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, this pr prints statistics per a whole-stage codegen entry, so the current one looks ok to me.

@SparkQA
Copy link

SparkQA commented Sep 12, 2019

Test build #110515 has finished for PR 25766 at commit fa4234c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ByteCodeStats(
  • * Returns the bytecode statistics (max class bytecode size, max method bytecode size,

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this is good and useful.

codeAttrField.setAccessible(true)
val codeSizes = classes.flatMap { case (_, classBytes) =>
CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.update(classBytes.length)
val codeStats = classes.map { case (_, classBytes) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to make the code more readable, by

val (classSizes, maxMethodSizes, constPoolSize) = classes.map....unzip3
ByteCodeStats(
  maxClassCodeSize = classSizes.max,
  maxMethodCodeSize = maxMethodSizes.max,
  maxConstPoolSize = constPoolSize.max,
  // Minus 2 for `GeneratedClass` and an outer-most generated class
  numInnerClasses = classSizes.size - 2)

test("Prints bytecode statistics in debugCodegen") {
Seq(("SELECT sum(v) FROM VALUES(1) t(v)", (0, 0)),
// We expect HashAggregate uses an inner class for fast hash maps
// in partial aggregates with keys.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to avoid end-to-end tests in this case. It's highly coupled with how we codegen these operators and is easy to break if we change the implementation in the future.

Can we add some UT that calls CodeGenerator.compile directly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I'll try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

} else {
""
}
val codeStatsStr = s"maxClassCodeSize:${codeStats.maxClassCodeSize}; " +
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one more metric for the ratio of an used constant pool like maxConstantPoolSize:130(0.20% used);

scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
*(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L])
+- *(1) LocalTableScan [v#0]

cc: @rednaxelafx @kiszk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am neutral on this. I am bit worry about the wider width regarding ease of reading._

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Could you elaborate on your idea to add this?

How about adding ratio of maxMethodCodeSize / CodeGenerator.DEFAULT_JVM_HUGE_METHOD_LIMIT, too?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, it might be worth adding it though, this pr already merged. We could revisit this in future if need this metric for debugging...

* Java bytecode statistics of a compiled class by Janino.
*/
case class ByteCodeStats(
maxClassCodeSize: Int, maxMethodCodeSize: Int, maxConstPoolSize: Int, numInnerClasses: Int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A ByteCodeStats matches to a compiled class? maxClassCodeSize is for max inner class code size?

Copy link
Member Author

@maropu maropu Sep 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code just collects the max size among a compiled class and inner classes. But, on second thoughs, I think now we don't need to print the class size cuz IIUC the size is not related to the JVM limits: https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.11

#25766 (comment)
WDYT? @kiszk

@SparkQA
Copy link

SparkQA commented Sep 13, 2019

Test build #110554 has finished for PR 25766 at commit a5885e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

} else {
""
}
val codeStatsStr = s"maxClassCodeSize:${codeStats.maxClassCodeSize}; " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about maxClassSize instead of maxClassCodeSize? maxClassCodeSize may imply max bytecode size in a class.

@SparkQA
Copy link

SparkQA commented Sep 13, 2019

Test build #110579 has finished for PR 25766 at commit be268de.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ByteCodeStats(maxMethodCodeSize: Int, maxConstPoolSize: Int, numInnerClasses: Int)

@SparkQA
Copy link

SparkQA commented Sep 14, 2019

Test build #110581 has finished for PR 25766 at commit dfc6a4c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ByteCodeStats(maxMethodCodeSize: Int, maxConstPoolSize: Int, numInnerClasses: Int)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 6297287 Sep 16, 2019
@maropu
Copy link
Member Author

maropu commented Sep 17, 2019

Thanks for the reviews, all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants