[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions#19767
[SPARK-22543][SQL] fix java 64kb compile error for deeply nested expressions#19767cloud-fan wants to merge 8 commits intoapache:masterfrom
Conversation
|
Test build #83943 has finished for PR 19767 at commit
|
|
I like this approach. Does this pr cover all the issue that #18641 describes? or, orthogonal? Anyway, I checked the TPCDS perf with this current pr: https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing |
|
Seems a good approach that saves us much effort to add similar codes for many expressions. |
|
Also from the numbers provided by @maropu, looks no significant regression. |
|
Probably, I feel we better track the changes of actual bytecode size statistics (e.g, |
There was a problem hiding this comment.
Should we move the comment into the function?
There was a problem hiding this comment.
I don't have a strong preference, it's ok to have comment at function caller side.
There was a problem hiding this comment.
nit: when isNull is evaluated to true at runtime, we don't need to set value.
There was a problem hiding this comment.
yea it's already done when define setIsNull
|
@maropu it partially covers #18641 . One problem is that, for an expression, if its child generates code less than 1024, and it has many children, then we still have an issue. |
|
Looks good direction if we do not see performance degradation. |
There was a problem hiding this comment.
Can we always pass value as a return value instead of void? It can reduce # of global variables.
There was a problem hiding this comment.
good suggestion! Actually, this is a general strategy which can be applied to more places. If there are only boolean global variables, it's very easy to fold them into one array.
There was a problem hiding this comment.
Thanks.
IMHO, I am curious whether we will see no performance degradation by using one array to compact many boolean variables.
I am waiting for the updated result in this discussion. This is because the current code seems to measure performance of interpreter due to lack of warmup.
There was a problem hiding this comment.
Is that a bad idea to prepare some utility classes to store a pair (value, isNull) for this splitting cases? I feel class fields are valuable resources.
There was a problem hiding this comment.
creating objects will be a big overhead. I think having a global boolean variable is better.
There was a problem hiding this comment.
I originally thought we could avoid the overhead by using thread-local singleton? But, it's a bit weird, so the current code looks good.
There was a problem hiding this comment.
The current code can pass the value as a local variable if this method is inlined, or can pass the value on register for return value if this method is not inlined.
On the other hand, to use a object will always introduce memory accesses to access fields in caller and callee.
|
Test build #83963 has finished for PR 19767 at commit
|
|
LGTM |
|
should this go to 2.2? |
There was a problem hiding this comment.
Actually I think this removed part is orthogonal to what this PR did. Even condition, true, and false expressions are not more than threshold individually, their combination is still more than the threshold. We will pack them into a big method after this PR.
This PR deals the oversize gen'd codes in deeply nested expressions, not oversize combination of codes from the children.
There was a problem hiding this comment.
I already explained it in #19767 (comment)
Mostly it's ok because the threshold is just an estimation, not a big deal to make it 2 times larger. CASE WHEN may be a problem and we can evaluate it in #18641 after this PR gets merged.
There was a problem hiding this comment.
Two problems I think for this. One is even the two childs' code don't exceed the threshold individually, a method not over 64k but over 8k is still big and bad for JIT. One is we estimate it with code length, I'm not sure if two 1000 length childs won't definitely generate 64k method in the end.
There was a problem hiding this comment.
There is no way to guarantee it with the current string based codegen framework, even without this PR. 1000 length code may also generate 64kb byte code in the end.
1024 is not a good estimation at all, kind of random to me. So multiplying it with 2 doesn't seem a big issue. CASE WHEN may have issue.
There was a problem hiding this comment.
BTW if it's really an issue, we can add splitting logic in non-leaf/non-unary nodes. This is much less work than before because: 1. no need to care about unary nodes 2. the splitting logic can be simpler because all children are guaranteed to generate less than 1000 LOC.
| val funcName = ctx.freshName(nodeName) | ||
| val funcFullName = ctx.addNewFunction(funcName, | ||
| s""" | ||
| |private $javaType $funcName(InternalRow ${ctx.INPUT_ROW}) { |
There was a problem hiding this comment.
To continue the discussion in #19767 (comment)
I think there are more global variables can be eliminated by leveraging the method return value. However in some cases, we use global variables to avoid creating an object for each iteration, then we are facing a trade-off between GC overhead and global variable overhead. It would be great if java has something like C struct and can allocate objects on method stack...
There was a problem hiding this comment.
thanks for this fix. I like your approach here.
Actually what happens depends on the type of the variable and anyway I think that most of the time we are reinitializing anyway these objects, thus the only thing we are saving using global variables is the pointer and I am not sure if this is a big deal.
|
@felixcheung I'd like to keep it in master only, it has larger impaction than other related PRs. |
|
Test build #84085 has finished for PR 19767 at commit
|
|
will review it tonight. |
| if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) { | ||
| val setIsNull = if (ve.isNull != "false" && ve.isNull != "true") { | ||
| val globalIsNull = ctx.freshName("globalIsNull") | ||
| ctx.addMutableState("boolean", globalIsNull, s"$globalIsNull = false;") |
| val ve = doGenCode(ctx, ExprCode("", isNull, value)) | ||
|
|
||
| // TODO: support whole stage codegen too | ||
| if (ve.code.trim.length > 1024 && ctx.INPUT_ROW != null && ctx.currentVars == null) { |
There was a problem hiding this comment.
Could you change 1024 to 1? Just to ensure whether all the tests can pass and then change it back to 1024?
There was a problem hiding this comment.
I think it won't work because of hitting other limitations, e.g. JVM constant pool.
I'll try something bigger, like 100
| """.stripMargin) | ||
|
|
||
| ve.value = newValue | ||
| ve.code = s"$javaType $newValue = $funcFullName(${ctx.INPUT_ROW});" |
There was a problem hiding this comment.
Create a separate function for this?
|
LGTM except the above comments. |
|
Test build #84099 has finished for PR 19767 at commit
|
| } | ||
|
|
||
| val javaType = ctx.javaType(dataType) | ||
| val newValue = ctx.freshName("value") |
There was a problem hiding this comment.
why is this needed? I think we can use eval.value instead of it
There was a problem hiding this comment.
ev.value may be a global variable and here we need a local variable.
There was a problem hiding this comment.
why do we strictly need a local variable here? Can't we simply assign ev.value to the generated function return value?
There was a problem hiding this comment.
then how are we going to change this?
eval.code = s"$javaType $newValue = $funcFullName(${ctx.INPUT_ROW});"
Saving a local variable is nothing and I think we shouldn't complicate the code(check if a variable is global) because of this.
There was a problem hiding this comment.
ah, do you mean just do eval.value = s"$funcFullName(${ctx.INPUT_ROW})"? Let me try
There was a problem hiding this comment.
I meant:
eval.code = s"${eval.value} = $funcFullName(${ctx.INPUT_ROW});"
There was a problem hiding this comment.
this won't work because ${eval.value} is not declared if it's not a global variable. I went with
eval.code = ""
eval.value = s"$funcFullName(${ctx.INPUT_ROW})"
There was a problem hiding this comment.
I see, sorry, you're right. Then I think your previous solution is better: in this way if eval.value is used multiple times we are recomputing the function every time, thus your original implementation was fine, sorry for the bad comment.
|
Test build #84104 has finished for PR 19767 at commit
|
|
Test build #84111 has finished for PR 19767 at commit
|
|
Test build #84110 has finished for PR 19767 at commit
|
|
Test build #84112 has finished for PR 19767 at commit
|
|
LGTM |
|
Thanks! Merged to master. |
What changes were proposed in this pull request?
A frequently reported issue of Spark is the Java 64kb compile error. This is because Spark generates a very big method and it's usually caused by 3 reasons:
This PR focuses on 1. There are already several patches(#15620 #18972 #18641) trying to fix this issue and some of them are already merged. However this is an endless job as every non-leaf expression has this issue.
This PR proposes to fix this issue in
Expression.genCode, to make sure the code for a single expression won't grow too big.According to @maropu 's benchmark, no regression is found with TPCDS (thanks @maropu !): https://docs.google.com/spreadsheets/d/1K3_7lX05-ZgxDXi9X_GleNnDjcnJIfoSlSCDZcL4gdg/edit?usp=sharing
How was this patch tested?
existing test