[SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 #23667

HyukjinKwon · 2019-01-28T04:51:55Z

What changes were proposed in this pull request?

This PR reverts JSON count optimization part of #21909.

We cannot distinguish the cases below without parsing:

[{...}, {...}]

[]

{...}

# empty string

when we count(). One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input.

How was this patch tested?

Manually tested.

SparkQA · 2019-01-28T08:05:02Z

Test build #101745 has finished for PR 23667 at commit dd5b177.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-28T08:39:44Z

retest this please

SparkQA · 2019-01-28T12:47:22Z

Test build #101753 has finished for PR 23667 at commit dd5b177.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

It's a straight revert right? Looks OK. I agree we need to undo it for now because of the correctness issue. And then merge #23602

HyukjinKwon · 2019-01-30T02:51:35Z

Will get this in in few days if there's no objection.

SparkQA · 2019-01-31T06:13:22Z

Test build #101926 has finished for PR 23667 at commit 466e3dd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-01-31T06:33:32Z

Merged to master.

I am going to open a backport soon.

cloud-fan · 2019-01-31T06:57:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/UnivocityParser.scala

+  } else {
+    // If `columnPruning` enabled and partition attributes scanned only,
+    // `schema` gets empty.
+    (_: String) => InternalRow.empty


It's too long ago and I can't remember the details. Does it mean we still have this count optimization for CSV? does it work in multiline mode?

Yes, it does for CSV when multiline is off and, for miltiline mode it executes a different code path.

UnivocityParser.parseStream -> UnivocityParser.convert

… count ## What changes were proposed in this pull request? This PR consists of the `test` components of #23665 only, minus the associated patch from that PR. It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior. This PR is intended to be deployed alongside #23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745). ## How was this patch tested? Manual testing, existing `JsonSuite` unit tests. Closes #23674 from sumitsu/json_emptyline_count_test. Authored-by: Branden Smith <branden.smith@publicismedia.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…ARK-24959 ## What changes were proposed in this pull request? This PR reverts JSON count optimization part of apache#21909. We cannot distinguish the cases below without parsing: ``` [{...}, {...}] ``` ``` [] ``` ``` {...} ``` ```bash # empty string ``` when we `count()`. One line (input: IN) can be, 0 record, 1 record and multiple records and this is dependent on each input. See also apache#23665 (comment). ## How was this patch tested? Manually tested. Closes apache#23667 from HyukjinKwon/revert-SPARK-24959. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… count ## What changes were proposed in this pull request? This PR consists of the `test` components of apache#23665 only, minus the associated patch from that PR. It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior. This PR is intended to be deployed alongside apache#23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745). ## How was this patch tested? Manual testing, existing `JsonSuite` unit tests. Closes apache#23674 from sumitsu/json_emptyline_count_test. Authored-by: Branden Smith <branden.smith@publicismedia.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

… count This PR consists of the `test` components of #23665 only, minus the associated patch from that PR. It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior. This PR is intended to be deployed alongside #23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745). Manual testing, existing `JsonSuite` unit tests. Closes #23674 from sumitsu/json_emptyline_count_test. Authored-by: Branden Smith <branden.smith@publicismedia.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 63bced9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

… count This PR consists of the `test` components of apache#23665 only, minus the associated patch from that PR. It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior. This PR is intended to be deployed alongside apache#23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745). Manual testing, existing `JsonSuite` unit tests. Closes apache#23674 from sumitsu/json_emptyline_count_test. Authored-by: Branden Smith <branden.smith@publicismedia.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 63bced9) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>

MaxGekk · 2020-07-07T07:09:42Z

Not only in JSON, in CSV as well.

HyukjinKwon mentioned this pull request Jan 28, 2019

[SPARK-26745][SQL] Skip empty lines in JSON-derived DataFrames when skipParsing optimization in effect #23665

Closed

sumitsu mentioned this pull request Jan 28, 2019

[SPARK-26745][SQL][TESTS] JsonSuite test case: empty line -> 0 record count #23674

Closed

srowen approved these changes Jan 29, 2019

View reviewed changes

Revert count optimization in JSON datasource by SPARK-24959

466e3dd

HyukjinKwon force-pushed the revert-SPARK-24959 branch from dd5b177 to 466e3dd Compare January 31, 2019 02:05

asfgit closed this in d4d6df2 Jan 31, 2019

cloud-fan reviewed Jan 31, 2019

View reviewed changes

HyukjinKwon deleted the revert-SPARK-24959 branch March 3, 2020 01:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 #23667

[SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 #23667

Uh oh!

HyukjinKwon commented Jan 28, 2019

Uh oh!

SparkQA commented Jan 28, 2019

Uh oh!

HyukjinKwon commented Jan 28, 2019

Uh oh!

SparkQA commented Jan 28, 2019

Uh oh!

srowen left a comment

Uh oh!

HyukjinKwon commented Jan 30, 2019

Uh oh!

SparkQA commented Jan 31, 2019

Uh oh!

HyukjinKwon commented Jan 31, 2019

Uh oh!

cloud-fan Jan 31, 2019

Uh oh!

HyukjinKwon Jan 31, 2019 •

edited

Loading

Uh oh!

MaxGekk commented Jul 7, 2020

Uh oh!

Uh oh!

[SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 #23667

[SPARK-26745][SQL] Revert count optimization in JSON datasource by SPARK-24959 #23667

Uh oh!

Conversation

HyukjinKwon commented Jan 28, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 28, 2019

Uh oh!

HyukjinKwon commented Jan 28, 2019

Uh oh!

SparkQA commented Jan 28, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jan 30, 2019

Uh oh!

SparkQA commented Jan 31, 2019

Uh oh!

HyukjinKwon commented Jan 31, 2019

Uh oh!

cloud-fan Jan 31, 2019

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jan 31, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Jul 7, 2020

Uh oh!

Uh oh!

HyukjinKwon Jan 31, 2019 •

edited

Loading