-
Notifications
You must be signed in to change notification settings - Fork 28.6k
[SPARK-24505][SQL] Convert strings in codegen to blocks: Cast and BoundAttribute #21537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Test build #91697 has finished for PR 21537 at commit
|
retest this please. |
@@ -625,25 +625,23 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String | |||
val eval = child.genCode(ctx) | |||
val nullSafeCast = nullSafeCastFunction(child.dataType, dataType, ctx) | |||
|
|||
ev.copy(code = | |||
ev.copy(code = eval.code + | |||
code""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be removed, can't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh. Yes.
private[this] def timestampToDoubleCode(ts: String): String = s"$ts / 1000000.0" | ||
private[this] def decimalToTimestampCode(d: ExprValue): ExprValue = | ||
JavaCode.expression( | ||
s"($d.toBigDecimal().bigDecimal().multiply(new java.math.BigDecimal(1000000L))).longValue()", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it safe to do this? aren't we loosing the reference to $d
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, here we should use Block
. Fixed.
/** | ||
* Creates an `ExprValue` representing a local java variable of required data type. | ||
*/ | ||
def freshName(name: String, dt: DataType): VariableValue = JavaCode.variable(freshName(name), dt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about some better name like getLocalVariable
or something similar? this is not just a fresh name...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Not sure if @cloud-fan wants me to change it in this PR or other #21537 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either we decide to postpone the introduction of this API to a followup PR or we can change its name now that we are introducing it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about we move it to JavaCode
and follow the name convention there and call it freshVariable
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting it in CodeGenerator
is just for convenience that we can save a ctx
parameter. I'm fine to let it in JavaCode
too if you don't think it's verbose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I missed the ctx parameter thing, let's leave it.
Test build #91706 has finished for PR 21537 at commit
|
/** | ||
* Creates an `ExprValue` representing a local boolean java variable. | ||
*/ | ||
def isNullFreshName(name: String): VariableValue = JavaCode.isNullVariable(freshName(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm, now looking at this, isNullFreshName(name)
doesn't save much typing compared to freshName(name, BooleanType)
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, indeed. At least there is only one parameter. Maybe we should just use freshName(name, BooleanType)
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea, but we should open another PR to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, don't we want to do this in this PR? I can change it to freshName(name, BooleanType)
together with the changes for other comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the new one in this PR but open another PR to rename JavaCode.isNullVariable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename JavaCode.isNullVariable
? You meant renaming isNullFreshName
to freshName(name, BooleanType)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isNullFreshName
is new, we don't need it and can just call freshName(name, BooleanType)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
Test build #91744 has finished for PR 21537 at commit
|
Test build #91743 has finished for PR 21537 at commit
|
retest this please. |
Test build #91752 has finished for PR 21537 at commit
|
Test build #91754 has finished for PR 21537 at commit
|
val buffer = ctx.freshName("buffer") | ||
val bufferClass = classOf[UTF8StringBuilder].getName | ||
val buffer = ctx.freshVariable("buffer", classOf[UTF8StringBuilder]) | ||
val bufferClass = JavaCode.className(classOf[UTF8StringBuilder]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, each variable defined by freshVariable
has a type. We can get a type or its class name from the variable (e.g. bufffer
). Therefore, it looks redundant to declare a name of each variable again (e.g. bufferClass).
Do we have an API to get a type of the variable or define an API to get a name of the class? This is because this pattern is very common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could add a class Variable
which GlobalVariable
and LocalVariable
inherit from having a declare
method taking an optional parameter initialValue
so we can just invoke it to declare each variable. But maybe this can also be a followup. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is fine with me to address this in another PR.
|
||
def dataToStringFunc(func: String, dataType: DataType) = { | ||
val funcName = ctx.freshName(func) | ||
val dataToStringCode = castToStringCode(dataType, ctx) | ||
val data = JavaCode.variable("data", dataType) | ||
val dataStr = JavaCode.variable("dataStr", StringType) | ||
ctx.addNewFunction(funcName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probably we can return this as inline
, so we don't have to put it everywhere we use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can consider this in follow-ups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this method dataToStringFunc()
is not used in other files, it would be good to address it in this PR. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Sounds good. Will update later.
@@ -155,6 +170,17 @@ object Block { | |||
|
|||
val CODE_BLOCK_BUFFER_LENGTH: Int = 512 | |||
|
|||
/** | |||
* A custom string interpolator which inlines all types of input arguments into a string without |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe this comment is better to be put to the InlineBlock
class, do you agree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think part of this comment can be moved to InlineBlock
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also do some improvement of document.
Test build #91822 has finished for PR 21537 at commit
|
retest this please. |
Test build #91831 has finished for PR 21537 at commit
|
Test build #91883 has finished for PR 21537 at commit
|
retest this please. |
Test build #91893 has finished for PR 21537 at commit
|
Is there any comments we should address in this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just have some minor comments/questions. Overall, I think it is ok, thanks for your work.
|
||
override def + (other: Block): Block = other match { | ||
case c: CodeBlock => Blocks(Seq(this, c)) | ||
case i: InlineBlock => InlineBlock(block + i.block) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean that if we have to consecutive InlineBlock
s we are concatenating them without any space? Is this ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we won't have too many cases of concatenating InlineBlock
. You can see InlineBlock
is mostly used to wrap a small piece of code like a java class name. I'm not sure if we need to add a newline here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, I don't think we should have an usage like:
val inline1 = InlineBlock(".....")
val inline2 = inline1 + InlineBlock(".....")
code"""
| $inline2
| ...
""".stripMargin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as of now probably we don't have... @cloud-fan @kiszk what do you think?
override def + (other: Block): Block = other match { | ||
case c: CodeBlock => Blocks(Seq(this, c)) | ||
case i: InlineBlock => InlineBlock(block + i.block) | ||
case b: Blocks => Blocks(Seq(this) ++ b.blocks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this +: b.blocks
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussed with @cloud-fan, I think this abstraction Blocks
is not necessary. Its usage can be replaced with CodeBlock
. I will submit a PR to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we do that PR first? I feel it's easier to review this PR after we clean up the Block
framework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I will do that PR first. Will ping you on the PR when it's ready.
/** | ||
* Create an `InlineBlock` for Java Class name. | ||
*/ | ||
def className(javaClass: Class[_]): InlineBlock = InlineBlock(javaClass.getName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure about the names of these 3 methods... They are doing kind of the same thing but this name is pretty different form the other 2... What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is javaClass
better for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about javaType
as the method below? They return the same thing, just with a different input, don't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Looks good to me.
Test build #92167 has finished for PR 21537 at commit
|
|
||
private[this] def castToBooleanCode(from: DataType): CastFunction = from match { | ||
case StringType => | ||
val stringUtils = StringUtils.getClass.getName.stripSuffix("$") | ||
val stringUtils = inline"${StringUtils.getClass.getName.stripSuffix("$")}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference between inline and code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
inline is just used as a wrapper for string, as we disallow silent string interpolation. We expand the content of an inline into string into a code block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made inline not a Block
but just a JavaCode
now. After rethinking it, inline is used for embedding a piece of string into code block because we don't allow direct string interpolation. So inline as a Block
doesn't make such sense.
@gatorsmile I think this is just the partial adoption of what was proposed here and implemented in SPARK-24121 (#21193). Despite I agree with you that we should have created a design doc for this, I think now is a bit late for it. Indeed the base API is already there, since the design has mostly been done in SPARK-24121 (and not in the scope of this PR). What is missing now, though, is a throughout adoption. But there are already many PRs which based on that API: SPARK-22856, SPARK-23951. And the current codebase already partially adopts it. So I am not sure whether reverting this PR would be a solution: we should revert many if we want the change the current design. I think a design can be created anyway, at least in order to formalize what was discussed and agreed in the various PRs and have a single source of information. We can also do some modifications to the current design according to the comments which could come out discussing the design doc, but I see them more like further changes than reverting this, as it would mean reverting many things. The only last thing I'd suggest is that probably @viirya is the best one for creating the design doc, since the original proposal and implementation was done by him. Thanks. |
This is bad. The design needs to be carefully reviewed before implementing it. Basically, we breaks the basic principles of software engineering. It is very strange to write the design doc after introducing such a fundamental API change. I think we need more expertise from the existing compiler internal experts. That is why I proposed to let @kiszk to lead this. We are not inventing anything new in our codegen framework. |
This is not related to the credit or not. I think we need to introduce an IR like what the compiler is doing instead of continuous improvement on the existing one, which is already very hacky and hard to debug and maintain. We are wasting the efforts and resources if we follow the current direction. |
I am sorry if I misunderstood. This was discussed sufficiently in the linked PRs. These are internal APIs and does not introduce behaviour changes like AnalysisBarrier which was still a good enough idea to try IMHO because it brought many benefits to reduce reanalyzing plans - this resolved many JIRA on the other hand, and I don't think AnalysisBarrier was particularly a disaster. Sometimes we should take a step forward and see if that causes an actual problem or not too. That was a good enough try and sufficiently reviewed and discussed at that time. When something should be reverted, there should usually be specific concerns; otherwise, I wouldn't revert this for vague concerns for now. This is an internal API that we can freely change later as well. Also, @viirya made a lot of efforts here and he's pretty active. I don't think dragging someone into this is a good idea for now. |
AnalysisBarrier does not introduce a behavior change. However, this requires our analyzer rules must be idempotent. The most recent correctness bug also shows another big potential hole https://issues.apache.org/jira/browse/SPARK-25051. It could break Analyzer rules without notice, since AnalysisBarrier is a leaf node. Basically, this is a disaster for Spark 2.3 release. You can count how many regressions we found after the release. This is bad. Unfortunately, I made the merge without enough consideration. If possible, I would revert that PR, but it is too late. Now, we are facing the similar issue again. We should stop continuing this direction without a proper design and review. Designing/enhancing the codegen framework requires more inputs from the experts in this area. I do not think the current design is well discussed, even if I saw some discussions in the initial PR. I am not the expert in this area. That is why I pinged @kiszk to drive this. BTW, the internal APIs we can change easily, but the design need a review especially for the core components of Spark SQL. |
I am fine to not revert it since it is too late. So many related PRs have been merged, but we need to seriously consider writing and reviewing the design docs before changing the code generation framework of Spark SQL. |
I know and agree it ended up with many regressions and bugs (strictly I meant those by behaviour changes, which was my mistake). I have been following that and I am glad that it has been reverted by another way. But I am pretty sure AnalysisBarrier motivated another way to solve it and we found introducing AnalysisBarrier might be not a good idea after getting that in in practice. We should of course have the sufficient discussion ahead but I mean it doesn't not necessarily mean a disaster. We just tried one way, found problem / better alternative and fixed that by another try. It shouldn't be frequent but I think it would've needed more time or never been fixed without the original try. This PR is relatively less critical comparing to that, and I usually merge PRs if that looks less risky or similar cases are merged. It has been open more than one month and that's why I checked this one over a week. Ideally, it's good if we have sufficient discussion for every change but in practice we can't. I think we need more active committers or reviewers, rather then just leaving those PR open. Also, we should consider revert #21193 too if this one is considered to be reverted. I wonder why this one was specially being considered to be reverted, and you say that in this PR. @viirya and @mgaido91 have been taking looks into these areas as far as I can tell, and anyone could lead if they are willing to do that I believe. The works and efforts were mostly done by @viirya and in this case it looks reasonably it makes sense to ask him. It just looked to me weird because it sounds to me who's an expert and better. |
To Spark users, introducing AnalysisBarrier is a disaster. However, to the developers of Spark internal, this is just a bug. If you served the customers who are heavily using Spark, you will understand what I am talking about. It is even hard to debug when the Spark jobs are very complex. Normally, we never commit/merge any PR that is useless, especially when the PR changes are not tiny. Reverting this PRs are also very painful. That is why Reynold took a few days to finish it. It is not a fun job for him to rewrite it. Based on the current work, I can expect there are hundreds of PRs that will be submitted for changing the codegen templates and polishing the current code. The reason why we did not merge this PR is that we are doubting this is the right thing to do. @rednaxelafx I am not saying @viirya and @mgaido91 did a bad job to submit many PRs to improve the existing one. However, we need to think of the fundamental problems we are solving in the codegen. Instead of reinventing a compiler, how about letting the compiler internal expert (in our community, we have @kiszk) to lead the effort and offer a design for this. Coding and designing are different issues. If possible, we need to find the best person to drive it. If @viirya and @mgaido91 think they are familiar with compiler internal, I am also glad to see the designs. |
Yea, yea. I understand. I wasn't trying to say the severity of effect by introducing AnalysisBarrier was trivial. True, I understand it is not an easy job. Thank you Reynold for that. Yea, also I don't mean to say we should just go ahead without sufficient discussion. Wanted to point out that there are positive aspects of the effort and try about AnalysisBarrier too. It wasn't all bad in a way.
If that's true, the concerns should be mentioned here and discussed. Was there a discussion about it in the community and did I miss it? I would appreciate if we can talk here.
If there is a design concern for that and better suggestion, let's file a JIRA. I want to see the problem, concerns and possible suggestions as well. Yup, I got that it might be better for someone who has some expertise in that area but I was thinking that they should purely based upon the community work primarily - in that way, it looked reasonable @viirya goes ahead since it's basically his work. If not, to me I don't see any particular one is preferred. Just wanted to point out that the baseline is open for anyone not for specific persons. If anyone is willing to do this, anyone is welcome to go ahead. So, primarily they should voluntarily join in without other factors. |
We are fully swamped by the hotfix and regressions of 2.3 release and the new features that are targeting to 2.4. We should post some comments in this PR earlier. Designing an IR for our codegen is the right thing we should do. [If you do not agree on this, we can discuss about it.] How to design an IR is a challenging task. The whole community is welcome to submit the designs and PRs. Everyone can show the ideas. The best idea will win. @HyukjinKwon If you have a bandwidth, please also have a try |
Thank for involving me in an important thread. I was busy this morning in Japan. I think there are three topics in the thread.
For 1., in short, my opinion is likely to revert this PR from the view like a release manager. As we know, it is a time to cut a release branch. This PR partially introduce a new representation. If there would be a bug in code generation at Spark 2.4, it may introduce a complication of debugging and fixing. WDYT? |
I wouldn't revert this unless there are specific concerns about this. Do you see any bug by a mixture |
If there's a bug, then let's fix in another JIRA. If that's impossible to fix or sounds super risky and there's something I missed, let's revert. |
For 2. and 3., it is harder to say my opinion in the comment. Let me say short comments at first. For 2., if I remember correctly, @viirya once wrote the API document in a JIRA entry. it would be good to update and add some thoughts about design as a first step. I understand that it is hard to keep the up-to-date design document, in particular, during the open-source development. This is because we have a lot of excellent comments in a PR. For 3., at the early implementation of SQL codegen (i.e. use
Now, we meet a problem that it is hard to rewrite a method argument due to problem 1. In my understanding, the current effort led by @viirya is trying to resolve problem 1. Of course, it would be happy for me to co-lead a discussion of the IR design for 2. |
Sounds cool, let's move both to JIRAs or mailing lists. |
Thanks everyone for this discussion. I think we all agree that what we need as a first thing is a design doc and we can move the discussion there. @kiszk thank you for your comments. I remember too a document with the proposed API, but I didn't remember if it was a design doc or not and if there was an official agreement on it. Anyway, the document is this one: https://docs.google.com/document/d/1By_V-A2sxCWbP7dZ5EzHIuMSe8K0fQL9lqovGWXnsfs. Probably we can start from it for a proper SPIP? I think the main goal of this is well described there, ie. allowing param replacement for splitting code in wholestage-codegen scenario. Thanks everybody for this discussion. |
Thanks for all involving in this discussion! Sorry that I was on a long flight and feels too tired to reply now. I just want to say for now, no matter what decision we made, to continue to improve existing framework or have a IR design, I will definitely actively join in. |
@HyukjinKwon I am worrying about the design of a mixture of representation s"" and code""? When the design is not good, it is hard to maintain it and add new code based on this. Let @cloud-fan decide whether we should revert it or not. @mgaido91 The doc you posted is one of the problems the IR can resolve. However, we need a high-level and low-level design doc for building a new IR in codegen. @kiszk Thanks for leading this effort! You are the best candidate to lead this in the community. @viirya Thanks for your professional reply! Have a good rest. A general suggestion. To avoid introducing the regressions, how about implementing a new one without changing the existing one? We can easily switch to the new one when it becomes stable. Thanks all for making our codegen better! |
@gatorsmile Thank you for your reply. Could you elaborate on your suggestion?
Does it mean to work in a particular branch or to work in a fork repository until its implementation has been completed ? |
I don't see a notable risk here. That's just to avoid string interpolation, which makes less error-prone, which is discussed already and the code change is small. I hope we can move other discussions to other threads like JIRA or mailing list so that people can see. It's quite difficult to find such discussion to me actually. |
It's a good point that we should have a design doc for this codegen infrastructure improvement, since it's very critical to Spark. And we should have it reviewed by the community. There were some discussions on the PRs and JIRAs, but it didn't happen in the dev list. This is something we should do next. At this stage, I think it's too late to revert anything related to the codegen improvement. There are so many codegen templates get touched and I think reverting is riskier. But we should hold it now until the design doc is reviewed by the community in dev list. |
If we will continue on improving current codegen framework, I think it is good to have a design doc reviewed by the community. If we decide to have IR design and get rid of this string based framework, do we still need to have design doc for the current codegen improvement? Or we can focus on IR design doc? |
@kiszk The initial prototype or proof of concept can be in any personal branch. When we merge it to the master branch, we still need to separate it from the current codegen and make it configurable. After the release, the users can choose which one to be used. When the new IR is stable, we can then consider deprecate the current one. This is majorly for product stability. We need to follow the similar principle for any big project. @viirya @mgaido91 Let us first focus on the new IR design and prototype. |
I think we can set up a place (mailling list or JIRA) to discuss the further thing about IR design, as suggested by @HyukjinKwon. This can be a co-work from interesting parties. |
@kiszk Please create a JIRA and we can post more ideas there. Thanks! |
Sure, let me create a JIRA. Also, it would be good to setup a place for discussion that can support multiple threads easily. |
@kiszk, have we created a JIRA or sent an email to dev mailing list yet? or am I missing something already that happened? |
@HyukjinKwon sorry for being late. I was swamped with multiple things. I have just submitted it. Looking forward to seeing feedback at https://issues.apache.org/jira/browse/SPARK-25728. |
What changes were proposed in this pull request?
This is split from #21520. This includes changes of
BoundAttribute
andCast
.This patch also adds few convenient APIs:
How was this patch tested?
Existing tests.