Skip to content

[SPARK-24505][SQL] Convert strings in codegen to blocks: Cast and BoundAttribute #21537

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 11 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented Jun 12, 2018

What changes were proposed in this pull request?

This is split from #21520. This includes changes of BoundAttribute and Cast.
This patch also adds few convenient APIs:

CodeGenerator.freshVariable(name: String, dt: DataType): VariableValue
CodeGenerator.freshVariable(name: String, javaClass: Class[_]): VariableValue

JavaCode.javaType(javaClass: Class[_]): Inline
JavaCode.javaType(dataType: DataType): Inline
JavaCode.boxedType(dataType: DataType): Inline

How was this patch tested?

Existing tests.

@viirya
Copy link
Member Author

viirya commented Jun 12, 2018

@SparkQA
Copy link

SparkQA commented Jun 12, 2018

Test build #91697 has finished for PR 21537 at commit 89d0252.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • implicit class InlineHelper(val sc: StringContext) extends AnyVal
  • case class InlineBlock(block: String) extends Block

@viirya
Copy link
Member Author

viirya commented Jun 12, 2018

retest this please.

@@ -625,25 +625,23 @@ case class Cast(child: Expression, dataType: DataType, timeZoneId: Option[String
val eval = child.genCode(ctx)
val nullSafeCast = nullSafeCastFunction(child.dataType, dataType, ctx)

ev.copy(code =
ev.copy(code = eval.code +
code"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed, can't it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Yes.

private[this] def timestampToDoubleCode(ts: String): String = s"$ts / 1000000.0"
private[this] def decimalToTimestampCode(d: ExprValue): ExprValue =
JavaCode.expression(
s"($d.toBigDecimal().bigDecimal().multiply(new java.math.BigDecimal(1000000L))).longValue()",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to do this? aren't we loosing the reference to $d?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, here we should use Block. Fixed.

/**
* Creates an `ExprValue` representing a local java variable of required data type.
*/
def freshName(name: String, dt: DataType): VariableValue = JavaCode.variable(freshName(name), dt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about some better name like getLocalVariable or something similar? this is not just a fresh name...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Not sure if @cloud-fan wants me to change it in this PR or other #21537 (comment).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think either we decide to postpone the introduction of this API to a followup PR or we can change its name now that we are introducing it...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about we move it to JavaCode and follow the name convention there and call it freshVariable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting it in CodeGenerator is just for convenience that we can save a ctx parameter. I'm fine to let it in JavaCode too if you don't think it's verbose.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I missed the ctx parameter thing, let's leave it.

@SparkQA
Copy link

SparkQA commented Jun 12, 2018

Test build #91706 has finished for PR 21537 at commit 89d0252.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • implicit class InlineHelper(val sc: StringContext) extends AnyVal
  • case class InlineBlock(block: String) extends Block

/**
* Creates an `ExprValue` representing a local boolean java variable.
*/
def isNullFreshName(name: String): VariableValue = JavaCode.isNullVariable(freshName(name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, now looking at this, isNullFreshName(name) doesn't save much typing compared to freshName(name, BooleanType)...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha, indeed. At least there is only one parameter. Maybe we should just use freshName(name, BooleanType).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, but we should open another PR to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, don't we want to do this in this PR? I can change it to freshName(name, BooleanType) together with the changes for other comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the new one in this PR but open another PR to rename JavaCode.isNullVariable

Copy link
Member Author

@viirya viirya Jun 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename JavaCode.isNullVariable? You meant renaming isNullFreshName to freshName(name, BooleanType)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isNullFreshName is new, we don't need it and can just call freshName(name, BooleanType)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91744 has finished for PR 21537 at commit 531faf4.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91743 has finished for PR 21537 at commit 7f486fe.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 13, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91752 has finished for PR 21537 at commit 531faf4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 13, 2018

Test build #91754 has finished for PR 21537 at commit 531faf4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val buffer = ctx.freshName("buffer")
val bufferClass = classOf[UTF8StringBuilder].getName
val buffer = ctx.freshVariable("buffer", classOf[UTF8StringBuilder])
val bufferClass = JavaCode.className(classOf[UTF8StringBuilder])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, each variable defined by freshVariable has a type. We can get a type or its class name from the variable (e.g. bufffer). Therefore, it looks redundant to declare a name of each variable again (e.g. bufferClass).
Do we have an API to get a type of the variable or define an API to get a name of the class? This is because this pattern is very common.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could add a class Variable which GlobalVariable and LocalVariable inherit from having a declare method taking an optional parameter initialValue so we can just invoke it to declare each variable. But maybe this can also be a followup. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is fine with me to address this in another PR.


def dataToStringFunc(func: String, dataType: DataType) = {
val funcName = ctx.freshName(func)
val dataToStringCode = castToStringCode(dataType, ctx)
val data = JavaCode.variable("data", dataType)
val dataStr = JavaCode.variable("dataStr", StringType)
ctx.addNewFunction(funcName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: probably we can return this as inline, so we don't have to put it everywhere we use it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can consider this in follow-ups.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this method dataToStringFunc() is not used in other files, it would be good to address it in this PR. WDYT?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Sounds good. Will update later.

@@ -155,6 +170,17 @@ object Block {

val CODE_BLOCK_BUFFER_LENGTH: Int = 512

/**
* A custom string interpolator which inlines all types of input arguments into a string without
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe this comment is better to be put to the InlineBlock class, do you agree?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think part of this comment can be moved to InlineBlock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also do some improvement of document.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91822 has finished for PR 21537 at commit b592e66.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 14, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 14, 2018

Test build #91831 has finished for PR 21537 at commit b592e66.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91883 has finished for PR 21537 at commit a972e0e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 15, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 15, 2018

Test build #91893 has finished for PR 21537 at commit a972e0e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member Author

viirya commented Jun 18, 2018

Is there any comments we should address in this PR?

Copy link
Contributor

@mgaido91 mgaido91 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have some minor comments/questions. Overall, I think it is ok, thanks for your work.


override def + (other: Block): Block = other match {
case c: CodeBlock => Blocks(Seq(this, c))
case i: InlineBlock => InlineBlock(block + i.block)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean that if we have to consecutive InlineBlocks we are concatenating them without any space? Is this ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we won't have too many cases of concatenating InlineBlock. You can see InlineBlock is mostly used to wrap a small piece of code like a java class name. I'm not sure if we need to add a newline here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, I don't think we should have an usage like:

val inline1 = InlineBlock(".....")
val inline2 = inline1 + InlineBlock(".....")
code"""
     | $inline2
     | ...
   """.stripMargin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as of now probably we don't have... @cloud-fan @kiszk what do you think?

override def + (other: Block): Block = other match {
case c: CodeBlock => Blocks(Seq(this, c))
case i: InlineBlock => InlineBlock(block + i.block)
case b: Blocks => Blocks(Seq(this) ++ b.blocks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this +: b.blocks?

Copy link
Member Author

@viirya viirya Jun 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussed with @cloud-fan, I think this abstraction Blocks is not necessary. Its usage can be replaced with CodeBlock. I will submit a PR to do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we do that PR first? I feel it's easier to review this PR after we clean up the Block framework.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I will do that PR first. Will ping you on the PR when it's ready.

/**
* Create an `InlineBlock` for Java Class name.
*/
def className(javaClass: Class[_]): InlineBlock = InlineBlock(javaClass.getName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about the names of these 3 methods... They are doing kind of the same thing but this name is pretty different form the other 2... What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is javaClass better for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about javaType as the method below? They return the same thing, just with a different input, don't they?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Looks good to me.

@SparkQA
Copy link

SparkQA commented Jun 21, 2018

Test build #92167 has finished for PR 21537 at commit 4b90551.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


private[this] def castToBooleanCode(from: DataType): CastFunction = from match {
case StringType =>
val stringUtils = StringUtils.getClass.getName.stripSuffix("$")
val stringUtils = inline"${StringUtils.getClass.getName.stripSuffix("$")}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference between inline and code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline is just used as a wrapper for string, as we disallow silent string interpolation. We expand the content of an inline into string into a code block.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made inline not a Block but just a JavaCode now. After rethinking it, inline is used for embedding a piece of string into code block because we don't allow direct string interpolation. So inline as a Block doesn't make such sense.

@mgaido91
Copy link
Contributor

@gatorsmile I think this is just the partial adoption of what was proposed here and implemented in SPARK-24121 (#21193).

Despite I agree with you that we should have created a design doc for this, I think now is a bit late for it. Indeed the base API is already there, since the design has mostly been done in SPARK-24121 (and not in the scope of this PR). What is missing now, though, is a throughout adoption. But there are already many PRs which based on that API: SPARK-22856, SPARK-23951. And the current codebase already partially adopts it.

So I am not sure whether reverting this PR would be a solution: we should revert many if we want the change the current design.

I think a design can be created anyway, at least in order to formalize what was discussed and agreed in the various PRs and have a single source of information. We can also do some modifications to the current design according to the comments which could come out discussing the design doc, but I see them more like further changes than reverting this, as it would mean reverting many things.

The only last thing I'd suggest is that probably @viirya is the best one for creating the design doc, since the original proposal and implementation was done by him.

Thanks.

@gatorsmile
Copy link
Member

gatorsmile commented Aug 15, 2018

This is bad. The design needs to be carefully reviewed before implementing it. Basically, we breaks the basic principles of software engineering. It is very strange to write the design doc after introducing such a fundamental API change.

I think we need more expertise from the existing compiler internal experts. That is why I proposed to let @kiszk to lead this. We are not inventing anything new in our codegen framework.

@mgaido91
Copy link
Contributor

Sure, I am fine with @kiszk leading the effort of course, I just wanted to give credit to @viirya for the (current) design as it was mostly done by him.

@gatorsmile
Copy link
Member

This is not related to the credit or not. I think we need to introduce an IR like what the compiler is doing instead of continuous improvement on the existing one, which is already very hacky and hard to debug and maintain. We are wasting the efforts and resources if we follow the current direction.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 15, 2018

I am sorry if I misunderstood. This was discussed sufficiently in the linked PRs. These are internal APIs and does not introduce behaviour changes like AnalysisBarrier which was still a good enough idea to try IMHO because it brought many benefits to reduce reanalyzing plans - this resolved many JIRA on the other hand, and I don't think AnalysisBarrier was particularly a disaster. Sometimes we should take a step forward and see if that causes an actual problem or not too. That was a good enough try and sufficiently reviewed and discussed at that time.

When something should be reverted, there should usually be specific concerns; otherwise, I wouldn't revert this for vague concerns for now.

This is an internal API that we can freely change later as well. Also, @viirya made a lot of efforts here and he's pretty active. I don't think dragging someone into this is a good idea for now.

@gatorsmile
Copy link
Member

AnalysisBarrier does not introduce a behavior change. However, this requires our analyzer rules must be idempotent. The most recent correctness bug also shows another big potential hole https://issues.apache.org/jira/browse/SPARK-25051. It could break Analyzer rules without notice, since AnalysisBarrier is a leaf node. Basically, this is a disaster for Spark 2.3 release. You can count how many regressions we found after the release. This is bad. Unfortunately, I made the merge without enough consideration. If possible, I would revert that PR, but it is too late.

Now, we are facing the similar issue again. We should stop continuing this direction without a proper design and review. Designing/enhancing the codegen framework requires more inputs from the experts in this area. I do not think the current design is well discussed, even if I saw some discussions in the initial PR. I am not the expert in this area. That is why I pinged @kiszk to drive this.

BTW, the internal APIs we can change easily, but the design need a review especially for the core components of Spark SQL.

@gatorsmile
Copy link
Member

I am fine to not revert it since it is too late. So many related PRs have been merged, but we need to seriously consider writing and reviewing the design docs before changing the code generation framework of Spark SQL.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 15, 2018

I know and agree it ended up with many regressions and bugs (strictly I meant those by behaviour changes, which was my mistake). I have been following that and I am glad that it has been reverted by another way. But I am pretty sure AnalysisBarrier motivated another way to solve it and we found introducing AnalysisBarrier might be not a good idea after getting that in in practice.

We should of course have the sufficient discussion ahead but I mean it doesn't not necessarily mean a disaster. We just tried one way, found problem / better alternative and fixed that by another try. It shouldn't be frequent but I think it would've needed more time or never been fixed without the original try.

This PR is relatively less critical comparing to that, and I usually merge PRs if that looks less risky or similar cases are merged. It has been open more than one month and that's why I checked this one over a week. Ideally, it's good if we have sufficient discussion for every change but in practice we can't. I think we need more active committers or reviewers, rather then just leaving those PR open.

Also, we should consider revert #21193 too if this one is considered to be reverted. I wonder why this one was specially being considered to be reverted, and you say that in this PR.

@viirya and @mgaido91 have been taking looks into these areas as far as I can tell, and anyone could lead if they are willing to do that I believe. The works and efforts were mostly done by @viirya and in this case it looks reasonably it makes sense to ask him. It just looked to me weird because it sounds to me who's an expert and better.

@gatorsmile
Copy link
Member

To Spark users, introducing AnalysisBarrier is a disaster. However, to the developers of Spark internal, this is just a bug. If you served the customers who are heavily using Spark, you will understand what I am talking about. It is even hard to debug when the Spark jobs are very complex.

Normally, we never commit/merge any PR that is useless, especially when the PR changes are not tiny. Reverting this PRs are also very painful. That is why Reynold took a few days to finish it. It is not a fun job for him to rewrite it.

Based on the current work, I can expect there are hundreds of PRs that will be submitted for changing the codegen templates and polishing the current code. The reason why we did not merge this PR is that we are doubting this is the right thing to do. @rednaxelafx

I am not saying @viirya and @mgaido91 did a bad job to submit many PRs to improve the existing one. However, we need to think of the fundamental problems we are solving in the codegen. Instead of reinventing a compiler, how about letting the compiler internal expert (in our community, we have @kiszk) to lead the effort and offer a design for this. Coding and designing are different issues. If possible, we need to find the best person to drive it. If @viirya and @mgaido91 think they are familiar with compiler internal, I am also glad to see the designs.

@HyukjinKwon
Copy link
Member

HyukjinKwon commented Aug 16, 2018

Yea, yea. I understand. I wasn't trying to say the severity of effect by introducing AnalysisBarrier was trivial. True, I understand it is not an easy job. Thank you Reynold for that. Yea, also I don't mean to say we should just go ahead without sufficient discussion. Wanted to point out that there are positive aspects of the effort and try about AnalysisBarrier too. It wasn't all bad in a way.

The reason why we did not merge this PR is that we are doubting this is the right thing to do. @rednaxelafx

If that's true, the concerns should be mentioned here and discussed. Was there a discussion about it in the community and did I miss it? I would appreciate if we can talk here.

Instead of reinventing a compiler, how about letting the compiler internal expert (in our community, we have @kiszk) to lead the effort and offer a design for this.

If there is a design concern for that and better suggestion, let's file a JIRA. I want to see the problem, concerns and possible suggestions as well.

Yup, I got that it might be better for someone who has some expertise in that area but I was thinking that they should purely based upon the community work primarily - in that way, it looked reasonable @viirya goes ahead since it's basically his work. If not, to me I don't see any particular one is preferred.

Just wanted to point out that the baseline is open for anyone not for specific persons. If anyone is willing to do this, anyone is welcome to go ahead. So, primarily they should voluntarily join in without other factors.

@gatorsmile
Copy link
Member

We are fully swamped by the hotfix and regressions of 2.3 release and the new features that are targeting to 2.4. We should post some comments in this PR earlier.

Designing an IR for our codegen is the right thing we should do. [If you do not agree on this, we can discuss about it.] How to design an IR is a challenging task. The whole community is welcome to submit the designs and PRs. Everyone can show the ideas. The best idea will win. @HyukjinKwon If you have a bandwidth, please also have a try

@kiszk
Copy link
Member

kiszk commented Aug 16, 2018

Thank for involving me in an important thread. I was busy this morning in Japan.

I think there are three topics in the thread.

  1. Merge or revert this PR
  2. Design document
  3. IR design

For 1., in short, my opinion is likely to revert this PR from the view like a release manager.

As we know, it is a time to cut a release branch. This PR partially introduce a new representation. If there would be a bug in code generation at Spark 2.4, it may introduce a complication of debugging and fixing.
As @mgaido91 pointed out, #20043 and #21026 have been merged. I think that they are a kind of refactoring (e.g. change the representation of literal, class, and so on). Thus, these two PRs can be there. However, this PR introduces a mixture of representation s"" and code"".

WDYT?

@HyukjinKwon
Copy link
Member

I wouldn't revert this unless there are specific concerns about this. Do you see any bug by a mixture
of representation s"" and code""?

@HyukjinKwon
Copy link
Member

If there's a bug, then let's fix in another JIRA. If that's impossible to fix or sounds super risky and there's something I missed, let's revert.

@kiszk
Copy link
Member

kiszk commented Aug 16, 2018

For 2. and 3., it is harder to say my opinion in the comment. Let me say short comments at first.

For 2., if I remember correctly, @viirya once wrote the API document in a JIRA entry. it would be good to update and add some thoughts about design as a first step. I understand that it is hard to keep the up-to-date design document, in particular, during the open-source development. This is because we have a lot of excellent comments in a PR.

For 3., at the early implementation of SQL codegen (i.e. use s"" to represent code), I thought there are two problems.

  1. lack of type of an expression (in other words, ExprCode did not have the type of value)
  2. lack of a structure of statements

Now, we meet a problem that it is hard to rewrite a method argument due to problem 1. In my understanding, the current effort led by @viirya is trying to resolve problem 1.
It is hard to rewrite a set of statements due to Problem 2. To resolve problem 2, we need more effort. In my opinion, we are addressing them step by step.

Of course, it would be happy for me to co-lead a discussion of the IR design for 2.

@HyukjinKwon
Copy link
Member

Sounds cool, let's move both to JIRAs or mailing lists.

@mgaido91
Copy link
Contributor

Thanks everyone for this discussion. I think we all agree that what we need as a first thing is a design doc and we can move the discussion there.

@kiszk thank you for your comments. I remember too a document with the proposed API, but I didn't remember if it was a design doc or not and if there was an official agreement on it. Anyway, the document is this one: https://docs.google.com/document/d/1By_V-A2sxCWbP7dZ5EzHIuMSe8K0fQL9lqovGWXnsfs. Probably we can start from it for a proper SPIP?

I think the main goal of this is well described there, ie. allowing param replacement for splitting code in wholestage-codegen scenario.

Thanks everybody for this discussion.

@viirya
Copy link
Member Author

viirya commented Aug 16, 2018

Thanks for all involving in this discussion!

Sorry that I was on a long flight and feels too tired to reply now. I just want to say for now, no matter what decision we made, to continue to improve existing framework or have a IR design, I will definitely actively join in.

@gatorsmile
Copy link
Member

gatorsmile commented Aug 16, 2018

@HyukjinKwon I am worrying about the design of a mixture of representation s"" and code""? When the design is not good, it is hard to maintain it and add new code based on this. Let @cloud-fan decide whether we should revert it or not.

@mgaido91 The doc you posted is one of the problems the IR can resolve. However, we need a high-level and low-level design doc for building a new IR in codegen.

@kiszk Thanks for leading this effort! You are the best candidate to lead this in the community.

@viirya Thanks for your professional reply! Have a good rest.

A general suggestion. To avoid introducing the regressions, how about implementing a new one without changing the existing one? We can easily switch to the new one when it becomes stable. Thanks all for making our codegen better!

@kiszk
Copy link
Member

kiszk commented Aug 16, 2018

@gatorsmile Thank you for your reply. Could you elaborate on your suggestion?

A general suggestion. To avoid introducing the regressions, how about implementing a new one without changing the existing one? We can easily switch to the new one when it becomes stable.

Does it mean to work in a particular branch or to work in a fork repository until its implementation has been completed ?

@HyukjinKwon
Copy link
Member

I don't see a notable risk here. That's just to avoid string interpolation, which makes less error-prone, which is discussed already and the code change is small.

I hope we can move other discussions to other threads like JIRA or mailing list so that people can see. It's quite difficult to find such discussion to me actually.

@cloud-fan
Copy link
Contributor

It's a good point that we should have a design doc for this codegen infrastructure improvement, since it's very critical to Spark. And we should have it reviewed by the community.

There were some discussions on the PRs and JIRAs, but it didn't happen in the dev list. This is something we should do next.

At this stage, I think it's too late to revert anything related to the codegen improvement. There are so many codegen templates get touched and I think reverting is riskier.

But we should hold it now until the design doc is reviewed by the community in dev list.

@viirya
Copy link
Member Author

viirya commented Aug 17, 2018

If we will continue on improving current codegen framework, I think it is good to have a design doc reviewed by the community.

If we decide to have IR design and get rid of this string based framework, do we still need to have design doc for the current codegen improvement? Or we can focus on IR design doc?

@gatorsmile
Copy link
Member

@kiszk The initial prototype or proof of concept can be in any personal branch. When we merge it to the master branch, we still need to separate it from the current codegen and make it configurable. After the release, the users can choose which one to be used. When the new IR is stable, we can then consider deprecate the current one. This is majorly for product stability. We need to follow the similar principle for any big project.

@viirya @mgaido91 Let us first focus on the new IR design and prototype.

@viirya
Copy link
Member Author

viirya commented Aug 17, 2018

I think we can set up a place (mailling list or JIRA) to discuss the further thing about IR design, as suggested by @HyukjinKwon. This can be a co-work from interesting parties.

@gatorsmile
Copy link
Member

@kiszk Please create a JIRA and we can post more ideas there. Thanks!

@kiszk
Copy link
Member

kiszk commented Aug 24, 2018

Sure, let me create a JIRA. Also, it would be good to setup a place for discussion that can support multiple threads easily.

@HyukjinKwon
Copy link
Member

@kiszk, have we created a JIRA or sent an email to dev mailing list yet? or am I missing something already that happened?

@kiszk
Copy link
Member

kiszk commented Oct 14, 2018

@HyukjinKwon sorry for being late. I was swamped with multiple things. I have just submitted it. Looking forward to seeing feedback at https://issues.apache.org/jira/browse/SPARK-25728.

@viirya viirya deleted the SPARK-24505-1 branch December 27, 2023 18:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants