-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-22226][SQL] splitExpression can create too many method calls in the outer class #19480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
// The number of named constants that can exist in the class is limited by the Constant Pool | ||
// limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a | ||
// threshold of 1600k bytes to determine when a function should be inlined to a private, nested | ||
// threshold of 1000k bytes to determine when a function should be inlined to a private, nested |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to change this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it for the possibly to be add method grouping all split methods? If yes, we can add the method into an inner class (see #19480 (comment)), so we don't need to change this threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, without this change, I got the Constant Pool limit exception of the {{NestedClass}}. Actually I tried to address this issue in this PR (#19447), but without the other changes in thee current PR I wasn't able to create a UT for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you try #19480 (comment), so we may avoid changing this threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't change that for the reason I explained there.
* | ||
* @param functionName String representing the name of the function | ||
* @param subclassName Optional value which is empty if the function is added to | ||
* the superclass, otherwise it contains the name of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superclass or outerclass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it's outerclass, thanks, my bad.
// that subclass | ||
val code = s""" | ||
|private $returnType $func($argString) { | ||
| ${makeSplitFunction(foldFunctions(subclassFunctions.map(name => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need makeSplitFunction
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do need for methods like compare
where we have to return something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to call the split methods, not the expression codes. In original splitExpressions
, makeSplitFunction
only works on the expression codes, not the calls of the methods. But here it applies it on calls of the methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I see. Nvm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but this are still splits, just bigger.
The case which is critical for this is compare
. In the other cases this function is identity
. Let's consider that case: without using it here, you'd get a compilation error, because these "grouping methods" don't return anything by default; instead, if you add it here, you have the default return 0;
statement at the end.
s"$name(${arguments.map(_._2).mkString(", ")})")))} | ||
|} | ||
""".stripMargin | ||
addNewFunctionToClass(func, code, subclassName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The class to add codes into can be over the threshold too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So shall we call addNewFunction
to add the caller method into an inner class and call those split functions in that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this is made intentionally because if we add it to the NestedClass
where these methods are added: they are not available in other nested classes...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we call them with instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might, but I think this would not be the best option, since in this way we are "reusing" some ConstantPool entries (like Utf8
for the name of the function) and it keeps everything about those things in the same class.
Moreover, foreach instance referenced from other classes, like it would be in that case, we are adding new entries to the Constant Pool of that class.
Thus this is the option which minimizes the number of entries added to the Constant Pools.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, I didn't get completely you last comments, might you elaborate it a bit more? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. I meant the increased inner classes/instances are added into the outerclass's Constant Pool entries. Outerclass should be more likely to be filled by various variables/methods...than inner classes. So I think to add into Constant Pool of inner classes should be less serious issue than outerclass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is also convincing that nested classes should be less. So maybe this is not a serious issue, although I'm not sure this can be completely safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but we are likely to recreate the same issue the PR is trying to solve in the last NestedClass
. The problem is that a method call, despite it is few bytes of code, introduce multiple entries in the constant pool. If we use addNewFunction
here, we will end up creating all the wrapper methods in the last NestedClass
, which will have the same issues as the outer class, becoming a new bottelneck.
Instead, in this way, we are splitting the method call among the several nested classes and each of them has its owns, thus it is reusing some constant pool entries.
As you pointed out, nested classes are less critical than the outer one, because the attributes are declared in the outer class and currently all the many small methods are called there. If we put addNewFunction
here, instead, we are making critical the last nested function because we are adding there all the many small function calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable to me.
* the outer class, otherwise it contains the name of the | ||
* inner class in which the function has been added. | ||
* @param subclassInstance Optional value which is empty if the function is added to | ||
* the superclass, otherwise it contains the name of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
superclass here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
val key = (f.subclassName.get, f.subclassInstance.get) | ||
acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName)) | ||
} | ||
.map { case ((subclassName, subclassInstance), subclassFunctions) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If subclassFunctions.length
is only 1, I think we don't need to add redundant wrapper caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, even though this is very unlikely. We might also think of a threshold maybe. Which is the right approach according to you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think it can only happen at the first or last sub-class functions. Seems the functions might be only included in two (?) sub-classes. Most of functions will be wrapped in one function call.
I'm not sure the proper threshold for this, maybe 5?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I will create a field to make this easy to be changed.
foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})")) | ||
val outerClassFunctions = functions | ||
.filter(_.subclassName.isEmpty) | ||
.map(_.functionName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the calls to outerclass functions should not be an issue, even they could be many?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, because since they are defined there, they add no entry to the constant pool
.filter(_.subclassName.isDefined) | ||
.foldLeft(Map.empty[(String, String), Seq[String]]) { case (acc, f) => | ||
val key = (f.subclassName.get, f.subclassInstance.get) | ||
acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of function calls should be important and be kept. Does the map guarantee that updated map still preserves the insertion order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
only up to 4 keys, thanks for the nice catch, I will fix this.
.filter(_.subclassName.isEmpty) | ||
.map(_.functionName) | ||
|
||
val innerClassFunctions = functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not be very intuitive for later readers. We'd better to add comments to explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will, thanks for the suggestion
@HyukjinKwon or @cloud-fan May you help to trigger the jenkins test? Thank you. |
@mgaido91 Thank you for working on this! Based on your explanation, sounds like a reasonable direction to me. |
ok to test |
Thank you @HyukjinKwon |
@@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper { | |||
} | |||
} | |||
|
|||
test("SPARK-22226: group splitted expressions into one method per nested class") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides the unit test, can you provide an end-to-end case that can trigger this issue too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a use case where I faced this problem. And I tried this patch on it. Unfortunately this contains a very complex business logic and I have not been able to reproduce it in a simple one. But if needed, I can try again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of copying your customer codes, can you making a fake one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 I can reproduce the issue by following test case. You can check it:
test("SPARK-22226: too much splitted expressions should not exceed constant pool limit") {
withSQLConf(
(SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) {
val colNumber = 1000
val baseDF = spark.range(10).toDF()
val newCols = (1 to colNumber).map { colIndex =>
expr(s"id + $colIndex").as(s"_$colIndex")
}
val input = baseDF.select(newCols: _*)
val aggs = (1 to colNumber).flatMap { colIndex =>
val colName = s"_$colIndex"
Seq(expr(s"stddev($colName)"),
expr(s"stddev_samp($colName)"),
expr(s"stddev_pop($colName)"),
expr(s"variance($colName)"),
expr(s"var_samp($colName)"),
expr(s"var_pop($colName)"),
expr(s"skewness($colName)"),
expr(s"kurtosis($colName)"))
}
input.agg(aggs.head, aggs.tail: _*).collect()
}
}
[info] Cause: org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeExc
eption: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection
has grown past JVM limit of 0xFFFF
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressi
ons$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1079)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you very much for your help @viirya ! In my use cases it seemed to be connected to the dropDuplicates
method and I focused on it, but thanks to your suggestion now I realize that dropDuplicates
by itself is not enough, it needs also some functions applied to columns to generate the issue! Thank you so much. Where should I add this test case? I am adding it to DataFrameAggregateSuite
since this is related to aggregating some functions, is it ok? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was adding it to DataFrameAggregateSuite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya I have a good and a bad news... Thanks to your suggestion I have been able to understand and reproduce the issue. Moreover, I found also another issue which is fixed by this problem and I am adding a UT for that too: in some cases, we might have a
Code of method apply(...) grows beyond 64 KB
And with this PR the problem is fixed.
The bad thing is that the UT you provided still fails, but with a different error: actually it is always a Constant Pool limit exceeded exception, but it is in a NestedClass. From my analysis, this is caused by another problem, ie. that we might reference too many fields of the superclass in the NestedClasses. This might be addressed maybe trying to tune the magic number which I brought to 1000k in this PR, but I am pretty sure that it will be also addressed by the ongoing PR for SPARK-18016, since he is trying to reduce the number of variables. Thus I consider this out of scope for this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 Do you meant "2: compact primitive declarations into arrays" in SPARK-18016?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mgaido91 Thanks for trying it. Yeah, those expressions like skewness
are very complicated, so they're likely to cause the issue you encountered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya exactly, I meant that. Thank you for your suggestion. You have been very helpful to me.
nit: Do we need |
@kiszk I added just to show that there is one method for instance, but then I am missing the NestedClass1 declaration. I am adding it. Thanks. |
@mgaido91 thanks, this example makes sense. |
Test build #82682 has finished for PR 19480 at commit
|
Test build #82687 has finished for PR 19480 at commit
|
Test build #82697 has finished for PR 19480 at commit
|
@@ -1010,6 +1079,8 @@ object CodeGenerator extends Logging { | |||
// This is the value of HugeMethodLimit in the OpenJDK JVM settings | |||
val DEFAULT_JVM_HUGE_METHOD_LIMIT = 8000 | |||
|
|||
val MERGE_SPLIT_METHODS_THRESHOLD = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a short comment for this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another magic number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gatorsmile it comes from this discussion.
The current change LGTM. If you can have an end-to-end test for this. This might be better. This seems not an issue very easy to encounter by end users. Can you try to have one? @mgaido91 |
Test build #83052 has finished for PR 19480 at commit
|
* the outer class, otherwise it contains the name of the | ||
* instance of the inner class in the outer class. | ||
*/ | ||
private[codegen] case class NewFunction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename it to NewFunctionSpec
?
* @param subclassName Optional value which is empty if the function is added to | ||
* the outer class, otherwise it contains the name of the | ||
* inner class in which the function has been added. | ||
* @param subclassInstance Optional value which is empty if the function is added to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw three ways to represent the same concept. subclass, inner class, nested classes.
How about renaming all of them to inner classes? Could you go over all the code changes in this PR to make them consistent?
} else { | ||
subclassFunctions.map(f => s"$subclassInstance.$f") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build a separate private function for generating innerClassFunctions? Now, the function splitExpressions
is pretty large after this PR.
.filter(_.subclassName.isEmpty) | ||
.map(_.functionName) | ||
|
||
val argsString = arguments.map(_._2).mkString(", ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move this just one line before line 876?
.filter(_.subclassName.isDefined) | ||
.foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) => | ||
val key = (f.subclassName.get, f.subclassInstance.get) | ||
acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve the readability, let us avoid using foldLeft
here. You can use a mutable map.
Test build #83079 has finished for PR 19480 at commit
|
Test build #83084 has finished for PR 19480 at commit
|
val argsString = arguments.map(_._2).mkString(", ") | ||
foldFunctions((outerClassFunctions ++ innerClassFunctions).map( | ||
name => s"$name($argsString)")) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
val (outerClassFunctions, innerClassFunctions) = functions.partition(_.innerClassName.isEmpty)
val argsString = arguments.map(_._2).mkString(", ")
val outerClassFunctionCalls = outerClassFunctions.map(f => s"${f.functionName}($argsString)")
val innerClassFunctionCalls = generateInnerClassesFunctionCalls(
innerClassFunctions,
func,
arguments,
returnType,
makeSplitFunction,
foldFunctions)
foldFunctions(outerClassFunctionCalls ++ innerClassFunctionCalls)
returnType: String, | ||
makeSplitFunction: String => String, | ||
foldFunctions: Seq[String] => String): Iterable[String] = { | ||
val innerClassToFunctions = mutable.ListMap.empty[(String, String), Seq[String]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not using LinkedHashMap
innerClassToFunctions.getOrElse(key, Seq.empty[String])) | ||
}) | ||
// for performance reasons, the functions are prepended, instead of appended, | ||
// thus they are in reversed order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you seen the perf issue when the number of functions is just thousands?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with 2.000 my tests on my machine (2.5 GHz Intel Core i7 OSX) are:
- with this solution 3ms
- with the append at the end 121 ms
And of course if the number increases, the difference grows.
addNewFunctionToClass(funcName, code, innerClassName) | ||
Seq(s"$innerClassInstance.$funcName") | ||
} else { | ||
innerClassFunctions.map(f => s"$innerClassInstance.$f") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, we are returning the function's invocations. Thus, passing arguments is more straightforward
innerClassFunctions.reverse.map(f => s"$innerClassInstance.$f($argInvocationString)")
|} | ||
""".stripMargin | ||
addNewFunctionToClass(funcName, code, innerClassName) | ||
Seq(s"$innerClassInstance.$funcName") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
Test build #83119 has finished for PR 19480 at commit
|
val key = (f.innerClassName.get, f.innerClassInstance.get) | ||
innerClassToFunctions.update(key, f.functionName +: | ||
innerClassToFunctions.getOrElse(key, Seq.empty[String])) | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functions.foreach { f =>
val key = (f.innerClassName.get, f.innerClassInstance.get)
val value = f.functionName +: innerClassToFunctions.getOrElse(key, Seq.empty[String])
innerClassToFunctions.put(key, value)
}
|private $returnType $funcName($argDefinitionString) { | ||
| ${makeSplitFunction(body)} | ||
|} | ||
""".stripMargin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Adding a new function to each inner class which contains the invocation of all the
// ones which have been added to that inner class. For example,
// private class NestedClass {
// private void apply_862(InternalRow i) { ... }
// private void apply_863(InternalRow i) { ... }
// ...
// private void apply(InternalRow i) {
// apply_862(i);
// apply_863(i);
// ...
// }
// }
val body = foldFunctions(orderedFunctions.map(name => s"$name($argInvocationString)"))
val code =
s"""
|private $returnType $funcName($argDefinitionString) {
| ${makeSplitFunction(body)}
|}
""".stripMargin
LGTM except two comments. Thanks for your work!!! @mgaido91 |
Thank you all for your reviews. I addressed the last comments. |
// } | ||
// } | ||
val body = foldFunctions(orderedFunctions.map(name => | ||
s"$name($argInvocationString)")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one line or you need to use {}
functions.foreach(f => { | ||
val key = (f.innerClassName.get, f.innerClassInstance.get) | ||
val value = f.functionName +: innerClassToFunctions.getOrElse(key, Seq.empty[String]) | ||
innerClassToFunctions.update(key, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: update
-> put
LGTM pending Jenkins |
@gatorsmile sorry, I wasn't aware. Thank you anyway. |
Test build #83127 has finished for PR 19480 at commit
|
Test build #83129 has finished for PR 19480 at commit
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
SPARK-18016 introduced
NestedClass
to avoid that the many methods generated bysplitExpressions
contribute to the outer class' constant pool, making it growing too much. Unfortunately, despite their definition is stored in theNestedClass
, they all are invoked in the outer class and for each method invocation, there are two entries added to the constant pool: aMethodref
and aUtf8
entry (you can easily check this compiling a simple sample class withjaninoc
and looking at its Constant Pool). This limits the scalability of the solution with very large methods which are split in a lot of small ones. This means that currently we are generating classes like this one:This PR reduce the Constant Pool size of the outer class by adding a new method to each nested class: in this method we invoke all the small methods generated by
splitExpression
in that nested class. In this way, in the outer class there is only one method invocation per nested class, reducing by orders of magnitude the entries in its constant pool because of method invocations. This means that after the patch the generated code becomes:How was this patch tested?
Added UT and existing UTs