[SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String #6218

JoshRosen · 2015-05-17T08:56:05Z

In DataFrame.describe(), the count aggregate produces an integer, the avg and stdev aggregates produce doubles, and min and max aggregates can produce varying types depending on what type of column they're applied to. As a result, we should cast all aggregate results to String so that describe()'s output types match its declared output schema.

AmplabJenkins · 2015-05-17T08:57:09Z

Merged build triggered.

AmplabJenkins · 2015-05-17T08:57:16Z

Merged build started.

JoshRosen · 2015-05-17T08:57:42Z

I discovered this issue while extending CatalystTypeConverters to convert UnsafeRows into Scala Rows. My new converter is stricter about types, which caused a test failure that exposed this bug.

SparkQA · 2015-05-17T08:59:21Z

Test build #32934 has started for PR 6218 at commit 696206c.

rxin · 2015-05-17T09:13:43Z

LGTM

SparkQA · 2015-05-17T11:29:22Z

Test build #32934 timed out for PR 6218 at commit 696206c after a configured wait of 150m.

AmplabJenkins · 2015-05-17T11:29:28Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-17T11:29:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32934/
Test FAILed.

JoshRosen · 2015-05-17T14:54:03Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-17T14:57:10Z

Merged build triggered.

AmplabJenkins · 2015-05-17T14:57:18Z

Merged build started.

SparkQA · 2015-05-17T14:59:06Z

Test build #32940 has started for PR 6218 at commit 696206c.

JoshRosen · 2015-05-17T19:52:20Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-17T19:58:22Z

Merged build started.

SparkQA · 2015-05-17T20:00:05Z

Test build #32944 has started for PR 6218 at commit 696206c.

SparkQA · 2015-05-17T20:05:25Z

Test build #32944 has finished for PR 6218 at commit 696206c.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-17T20:06:29Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-17T20:06:29Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32944/
Test FAILed.

rxin · 2015-05-17T21:30:00Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-17T21:32:10Z

Merged build triggered.

AmplabJenkins · 2015-05-17T21:32:19Z

Merged build started.

SparkQA · 2015-05-17T22:59:00Z

Test build #817 has started for PR 6218 at commit 696206c.

JoshRosen · 2015-05-18T00:02:39Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-18T00:07:11Z

Merged build triggered.

AmplabJenkins · 2015-05-18T00:07:17Z

Merged build started.

SparkQA · 2015-05-18T00:11:32Z

Test build #32954 has started for PR 6218 at commit 696206c.

SparkQA · 2015-05-18T02:34:15Z

Test build #32954 has finished for PR 6218 at commit 696206c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T02:34:21Z

Merged build finished. Test FAILed.

davies · 2015-05-18T20:58:51Z

This one is different, it may be related to format of a double value (the number of ending zeros).

shivaram · 2015-05-18T21:00:32Z

We use testthat https://github.com/hadley/testthat to do assertions. I am not sure if it supports the ScalaTest like error messages.

JoshRosen · 2015-05-18T21:05:38Z

Ah, looks like we can use expect_equal instead of expect_true: http://r-pkgs.had.co.nz/tests.html.

The problem here was an extra decimal point:

2. Failure(@test_sparkSQL.R#765): describe() on a DataFrame --------------------
collect(stats)[5, "age"] not equal to "30.0"
1 string mismatches:
x[1]: "30.0"
y[1]: "30"

shivaram · 2015-05-18T21:09:12Z

Ah - good catch ! You can also file a JIRA to audit the rest of the test cases to use expect_equal

AmplabJenkins · 2015-05-18T21:12:13Z

Merged build triggered.

AmplabJenkins · 2015-05-18T21:12:22Z

Merged build started.

SparkQA · 2015-05-18T21:14:17Z

Test build #33020 has started for PR 6218 at commit 146b615.

JoshRosen · 2015-05-18T21:20:01Z

@shivaram I've filed https://issues.apache.org/jira/browse/SPARK-7714 for the SparkR test improvements.

SparkQA · 2015-05-18T23:03:18Z

Test build #33020 has finished for PR 6218 at commit 146b615.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-18T23:03:23Z

Merged build finished. Test FAILed.

AmplabJenkins · 2015-05-18T23:03:23Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33020/
Test FAILed.

JoshRosen · 2015-05-18T23:04:18Z

Test failures appear to be unrelated ORC failures, although that means that R didn't get a chance to run. I think that this has been fixed in master, so let's try another round of testing.

JoshRosen · 2015-05-18T23:04:21Z

Jenkins, retest this please.

AmplabJenkins · 2015-05-18T23:07:13Z

Merged build triggered.

AmplabJenkins · 2015-05-18T23:07:22Z

Merged build started.

SparkQA · 2015-05-18T23:09:22Z

Test build #33027 has started for PR 6218 at commit 146b615.

SparkQA · 2015-05-19T01:30:57Z

Test build #33027 has finished for PR 6218 at commit 146b615.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2015-05-19T01:31:02Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-19T01:31:02Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/33027/
Test PASSed.

… String In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to. As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema. Author: Josh Rosen <joshrosen@databricks.com> Closes #6218 from JoshRosen/SPARK-7687 and squashes the following commits: 146b615 [Josh Rosen] Fix R test. 2974bd5 [Josh Rosen] Cast to string type instead f206580 [Josh Rosen] Cast to double to fix SPARK-7687 307ecbf [Josh Rosen] Add failing regression test for SPARK-7687 (cherry picked from commit c9fa870) Signed-off-by: Reynold Xin <rxin@databricks.com>

… String In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to. As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6218 from JoshRosen/SPARK-7687 and squashes the following commits: 146b615 [Josh Rosen] Fix R test. 2974bd5 [Josh Rosen] Cast to string type instead f206580 [Josh Rosen] Cast to double to fix SPARK-7687 307ecbf [Josh Rosen] Add failing regression test for SPARK-7687

…c row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - #6218: DataFrame.describe() should cast all aggregates to String - #6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <joshrosen@databricks.com> Closes #6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters

… String In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to. As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6218 from JoshRosen/SPARK-7687 and squashes the following commits: 146b615 [Josh Rosen] Fix R test. 2974bd5 [Josh Rosen] Cast to string type instead f206580 [Josh Rosen] Cast to double to fix SPARK-7687 307ecbf [Josh Rosen] Add failing regression test for SPARK-7687

…c row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - apache#6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - apache#6218: DataFrame.describe() should cast all aggregates to String - apache#6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters

… String In `DataFrame.describe()`, the `count` aggregate produces an integer, the `avg` and `stdev` aggregates produce doubles, and `min` and `max` aggregates can produce varying types depending on what type of column they're applied to. As a result, we should cast all aggregate results to String so that `describe()`'s output types match its declared output schema. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6218 from JoshRosen/SPARK-7687 and squashes the following commits: 146b615 [Josh Rosen] Fix R test. 2974bd5 [Josh Rosen] Cast to string type instead f206580 [Josh Rosen] Cast to double to fix SPARK-7687 307ecbf [Josh Rosen] Add failing regression test for SPARK-7687

…c row accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - apache#6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - apache#6218: DataFrame.describe() should cast all aggregates to String - apache#6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <joshrosen@databricks.com> Closes apache#6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters

JoshRosen changed the title ~~[SPARK-7687] DataFrame.describe() should cast all aggregates to doubles~~ [SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to doubles May 17, 2015

JoshRosen mentioned this pull request May 17, 2015

[SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row accessors #6222

Closed

Fix R test.

146b615

JoshRosen force-pushed the SPARK-7687 branch from 5946da9 to 146b615 Compare May 18, 2015 21:08

asfgit closed this in c9fa870 May 19, 2015

JoshRosen deleted the SPARK-7687 branch May 24, 2015 04:57

[SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String #6218

[SPARK-7687] [SQL] DataFrame.describe() should cast all aggregates to String #6218

Uh oh!

Conversation

JoshRosen commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

JoshRosen commented May 17, 2015

Uh oh!

SparkQA commented May 17, 2015

Uh oh!

rxin commented May 17, 2015

Uh oh!

SparkQA commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

JoshRosen commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

SparkQA commented May 17, 2015

Uh oh!

JoshRosen commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

SparkQA commented May 17, 2015

Uh oh!

SparkQA commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

rxin commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

AmplabJenkins commented May 17, 2015

Uh oh!

SparkQA commented May 17, 2015

Uh oh!

JoshRosen commented May 18, 2015

Uh oh!

AmplabJenkins commented May 18, 2015

Uh oh!

AmplabJenkins commented May 18, 2015

Uh oh!

SparkQA commented May 18, 2015

Uh oh!

SparkQA commented May 18, 2015

Uh oh!

AmplabJenkins commented May 18, 2015

Uh oh!

davies commented May 18, 2015

Uh oh!

shivaram commented May 18, 2015

Uh oh!

JoshRosen commented May 18, 2015

Uh oh!

shivaram commented May 18, 2015

Uh oh!

AmplabJenkins commented May 18, 2015

Uh oh!

AmplabJenkins commented May 18, 2015

Uh oh!

SparkQA commented May 18, 2015

Uh oh!

JoshRosen commented May 18, 2015

Uh oh!

SparkQA commented May 18, 2015

Uh oh!