Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-32106][SQL] Implement script transform in sql/core #29414

Closed
wants to merge 16 commits into from

Conversation

AngersZhuuuu
Copy link
Contributor

@AngersZhuuuu AngersZhuuuu commented Aug 12, 2020

What changes were proposed in this pull request?

  • Implement SparkScriptTransformationExec based on BaseScriptTransformationExec
  • Implement SparkScriptTransformationWriterThread based on BaseScriptTransformationWriterThread of writing data
  • Add rule SparkScripts to support convert script LogicalPlan to SparkPlan in Spark SQL (without hive mode)
  • Add SparkScriptTransformationSuite test spark spec case
  • add test in SQLQueryTestSuite

And we will close #29085 .

Why are the changes needed?

Support user use Script Transform without Hive

Does this PR introduce any user-facing change?

User can use Script Transformation without hive in no serde mode.
Such as :
**default no serde **

SELECT TRANSFORM(a, b, c)
USING 'cat' AS (a int, b string, c long)
FROM testData

no serde with spec ROW FORMAT DELIMITED

SELECT TRANSFORM(a, b, c)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'null'
USING 'cat' AS (a, b, c)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '\u0004'
MAP KEYS TERMINATED BY '\u0005'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'NULL'
FROM testData

How was this patch tested?

Added UT

@AngersZhuuuu
Copy link
Contributor Author

FYI @maropu @cloud-fan

@SparkQA
Copy link

SparkQA commented Aug 12, 2020

Test build #127370 has finished for PR 29414 at commit b4c4b40.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class SparkScriptTransformationExec(
  • case class SparkScriptTransformationWriterThread(

@SparkQA
Copy link

SparkQA commented Aug 12, 2020

Test build #127373 has finished for PR 29414 at commit c58729b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'NULL'
USING 'cat' AS (a, b, c, d)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can ROW FORMAT DELIMETED with output schema work correctly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

@AngersZhuuuu AngersZhuuuu Aug 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, could you add test cases for the parser, too? https://github.com/apache/spark/pull/29414/files#diff-36e2b29ae675caaa1fce16e74fbd8710R1135

Add some UT in transform.sql and fix origin wrong part

@SparkQA
Copy link

SparkQA commented Aug 13, 2020

Test build #127402 has finished for PR 29414 at commit fc5bbce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 13, 2020

Test build #127406 has finished for PR 29414 at commit fdfc987.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 24, 2020

Test build #127829 has finished for PR 29414 at commit dabae9b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Aug 24, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Aug 24, 2020

Test build #127834 has finished for PR 29414 at commit dabae9b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu AngersZhuuuu changed the title [SPARK-32106][SQL] Implement script transform in sql/core [SPARK-32106][SQL][test-hadoop2.7][test-hive1.2] Implement script transform in sql/core Aug 24, 2020
@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 24, 2020

Test build #127841 has finished for PR 29414 at commit dabae9b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

ping @maropu @HyukjinKwon @cloud-fan

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 2, 2020

Test build #128204 has finished for PR 29414 at commit dabae9b.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Sep 2, 2020

Test build #128205 has finished for PR 29414 at commit dabae9b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37371/

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37371/

@SparkQA
Copy link

SparkQA commented Dec 14, 2020

Test build #132769 has finished for PR 29414 at commit d3c65d4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

I am good with this change. I don't mind if somebody merges.

@SparkQA
Copy link

SparkQA commented Dec 15, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37396/

@SparkQA
Copy link

SparkQA commented Dec 15, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37396/

@SparkQA
Copy link

SparkQA commented Dec 15, 2020

Test build #132794 has finished for PR 29414 at commit a10c5a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AngersZhuuuu
Copy link
Contributor Author

Hope merge this and start next work

@AngersZhuuuu
Copy link
Contributor Author

gentle ping @maropu @cloud-fan

@alfozan
Copy link

alfozan commented Dec 17, 2020

gentle ping @tejasapatil @sameeragarwal

@maropu
Copy link
Member

maropu commented Dec 17, 2020

okay, I will merge this into master in a few days if nobody has more comments.

@maropu
Copy link
Member

maropu commented Dec 18, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37574/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37574/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132973 has finished for PR 29414 at commit a10c5a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Dec 18, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132972 has finished for PR 29414 at commit a10c5a6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37581/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37581/

@SparkQA
Copy link

SparkQA commented Dec 18, 2020

Test build #132980 has finished for PR 29414 at commit a10c5a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Dec 22, 2020

okay, I believe we have much time to revisit this until the Spark v3.2.0 released, so I'll merge this into master for now so as to encourage @AngersZhuuuu 's following work. FYI: @HyukjinKwon @cloud-fan @dongjoon-hyun @gatorsmile

@maropu maropu closed this in 7466031 Dec 22, 2020
@maropu
Copy link
Member

maropu commented Dec 22, 2020

Merged to master.

@HyukjinKwon
Copy link
Member

Nice, I support your decision @maropu.

@dongjoon-hyun
Copy link
Member

Thank you for informing, @maropu !

struct<>
-- !query output
org.apache.spark.SparkException
Subprocess exited with status 127. Error: /bin/bash: some_non_existent_command: command not found
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to cause a test flakiness in GitHub Action. Could you take a look, @AngersZhuuuu and @maropu .

SQLQueryTestSuite.transform.sql
org.scalatest.exceptions.TestFailedException: transform.sql
Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2
SELECT TRANSFORM(a)
USING 'some_non_existent_command' AS (a)
FROM t

Copy link
Member

@dongjoon-hyun dongjoon-hyun Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@maropu maropu Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's too bad. Thanks for letting me know. I'll open a followup PR to fix it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for taking a look at that, @maropu .

Copy link
Member

@maropu maropu Dec 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review this: #30896

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review this: #30896

Thank you, @maropu
Yesterday I also found some different between UT and transform.sql. USING 'cat 1>&2' works well in SQLQuerySuite but won't work in transform.sql

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants