Skip to content

[SPARK-10310] [SQL] Fixes script transformation field/line delimiters #8860

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

liancheng
Copy link
Contributor

Please attribute this PR to Zhichao Li <zhichao.li@intel.com>.

This PR is based on PR #8476 authored by @zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default LazySimpleSerDe, and enabling default record reader/writer classes.

Currently, we only support LazySimpleSerDe, used together with TextRecordReader and TextRecordWriter, and don't support customizing record reader/writer using RECORDREADER/RECORDWRITER clauses. This should be addressed in separate PR(s).

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42803 has finished for PR 8860 at commit 7c4b03b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhichao-li
Copy link
Contributor

@liancheng I guess there's still some issue like:
i.e: If user specify serde with "LazySimpleSerde" explicitly, then it would use Text.write to serialize which would not match the behavior of tab as field delimeter and \n as the line delimiter.
Maybe we can adding more support in the other PR since most of the user would only depend on the default behavior, and this can cover the majority usages. :)

@liancheng liancheng force-pushed the spark-10310/fix-script-trans-delimiters branch 2 times, most recently from f86b5bc to 8d36775 Compare September 22, 2015 18:59
@liancheng liancheng force-pushed the spark-10310/fix-script-trans-delimiters branch from 8d36775 to 387ac72 Compare September 22, 2015 19:02
@liancheng
Copy link
Contributor Author

@zhichao-li I further special cased LazySimpleSerDe, so that we always use TextRecordReader/TextRecordWriter together with it, and uses can customize field delimiter now. Please check this test case.

@yhuai
Copy link
Contributor

yhuai commented Sep 22, 2015

test this please

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42846 has finished for PR 8860 at commit f86b5bc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42848 has finished for PR 8860 at commit 387ac72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 22, 2015

Test build #42855 has finished for PR 8860 at commit 387ac72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizer(JavaEstimator, HasInputCol, HasOutputCol):
    • class CountVectorizerModel(JavaModel):
    • s"Failed to convert value $v (class of $
    • s"Failed to convert value $v (class of $
    • case class Sort(

@yhuai
Copy link
Contributor

yhuai commented Sep 23, 2015

@zhichao-li Can you try this PR?

@zhichao-li
Copy link
Contributor

LGTM

@yhuai
Copy link
Contributor

yhuai commented Sep 23, 2015

Thanks! Merging to master and branch 1.5.

@asfgit asfgit closed this in 84f81e0 Sep 23, 2015
asfgit pushed a commit that referenced this pull request Sep 23, 2015
**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.**

This PR is based on PR #8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes.

Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s).

Author: Cheng Lian <lian@databricks.com>

Closes #8860 from liancheng/spark-10310/fix-script-trans-delimiters.

(cherry picked from commit 84f81e0)
Signed-off-by: Yin Huai <yhuai@databricks.com>
@liancheng liancheng deleted the spark-10310/fix-script-trans-delimiters branch September 24, 2015 00:06
ashangit pushed a commit to ashangit/spark that referenced this pull request Oct 19, 2016
**Please attribute this PR to `Zhichao Li <zhichao.liintel.com>`.**

This PR is based on PR apache#8476 authored by zhichao-li. It fixes SPARK-10310 by adding field delimiter SerDe property to the default `LazySimpleSerDe`, and enabling default record reader/writer classes.

Currently, we only support `LazySimpleSerDe`, used together with `TextRecordReader` and `TextRecordWriter`, and don't support customizing record reader/writer using `RECORDREADER`/`RECORDWRITER` clauses. This should be addressed in separate PR(s).

Author: Cheng Lian <lian@databricks.com>

Closes apache#8860 from liancheng/spark-10310/fix-script-trans-delimiters.

(cherry picked from commit 84f81e0)
Signed-off-by: Yin Huai <yhuai@databricks.com>
(cherry picked from commit 73d0621)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants