[SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] Parquet data source improvements #4308

liancheng · 2015-02-02T12:18:20Z

This PR adds three major improvements to Parquet data source:

Partition discovery

While reading Parquet files resides in Hive style partition directories, ParquetRelation2 automatically discovers partitioning information and infers partition column types.

This is also a partial work for SPARK-5182, which aims to provide first class partitioning support for the data source API. Related code in this PR can be easily extracted to the data source API level in future versions.
Schema merging

When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them. Exceptions are thrown when incompatible schemas are detected. This feature is controlled by data source option parquet.mergeSchema, and is enabled by default.
Metastore Parquet table conversion moved to analysis phase

This greatly simplifies the conversion logic. ParquetConversion strategy can be removed once the old Parquet implementation is removed in the future.

This version of Parquet data source aims to entirely replace the old Parquet implementation. However, the old version hasn't been removed yet. Users can fall back to the old version by turning off SQL configuration spark.sql.parquet.useDataSourceApi.

Other JIRA tickets fixed as side effects in this PR:

SPARK-5509: EqualTo now uses a proper Ordering to compare binary types.
SPARK-3575: Metastore schema is now preserved and passed to ParquetRelation2 via data source option parquet.metastoreSchema.

TODO:

More test cases for partition discovery

Fix write path after data source write support ([SPARK-5501][SPARK-5420][SQL] Write support for the data source API #4294) is merged

It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled.  Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now.

Fix outdated comments and documentations

PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes.

liancheng · 2015-02-02T12:19:54Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/predicates.scala

+      if (r == null) null
+      else if (left.dataType != BinaryType) l == r
+      else BinaryType.ordering.compare(
+        l.asInstanceOf[Array[Byte]], r.asInstanceOf[Array[Byte]]) == 0


This fixes SPARK-5509. Hit this bug while testing Parquet filters for new data source implementation.

btw this is really expensive. i'd use sth like this: http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/primitives/UnsignedBytes.html

If you don't want to change it as part of this PR, file a jira ticket to track it.

Filed SPARK-5553 to track this. I'd like to make sure equality comparison for binary types works properly in this PR. Also, we're already using Ordering to compare binary values in LessThan and GreaterThan etc., so at least this isn't a performance regression.

SparkQA · 2015-02-02T13:51:18Z

Test build #26514 has finished for PR 4308 at commit af3683e.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with SchemaRelationProvider

SparkQA · 2015-02-02T22:37:41Z

Test build #26537 has finished for PR 4308 at commit 0277e47.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with SchemaRelationProvider

SparkQA · 2015-02-03T02:22:57Z

Test build #26562 has finished for PR 4308 at commit 87689d5.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with SchemaRelationProvider

SparkQA · 2015-02-03T02:56:01Z

Test build #26572 has finished for PR 4308 at commit 170a0f8.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with SchemaRelationProvider

SparkQA · 2015-02-03T03:56:07Z

Test build #26595 has finished for PR 4308 at commit 1b11851.

This patch fails to build.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):
- trait Column extends DataFrame with ExpressionApi
- class ColumnName(name: String) extends IncomputableColumn(name)
- trait DataFrame extends DataFrameSpecificApi with RDDApi[Row]
- class GroupedDataFrame protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expression])
- protected[sql] class QueryExecution(val logical: LogicalPlan)
- class DefaultSource extends RelationProvider with SchemaRelationProvider

SparkQA · 2015-02-03T04:00:58Z

Test build #26596 has finished for PR 4308 at commit 07599a7.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):
- public class JDBCUtils
- trait Column extends DataFrame with ExpressionApi
- class ColumnName(name: String) extends IncomputableColumn(name)
- trait DataFrame extends DataFrameSpecificApi with RDDApi[Row]
- class GroupedDataFrame protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expression])
- protected[sql] class QueryExecution(val logical: LogicalPlan)
- logWarning(s"Couldn't find class $driver", e);
- implicit class JDBCDataFrame(rdd: DataFrame)
- class DefaultSource extends RelationProvider with SchemaRelationProvider

SparkQA · 2015-02-03T04:50:15Z

Test build #26591 has finished for PR 4308 at commit a760555.

This patch fails PySpark unit tests.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with SchemaRelationProvider

rxin · 2015-02-03T05:15:13Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala

+  def mergeCatalystSchemas(left: StructType, right: StructType): StructType =
+    mergeCatalystDataTypes(left, right).asInstanceOf[StructType]
+
+  def mergeCatalystDataTypes(left: DataType, right: DataType): DataType =


would be great to add more comment explaining what's going on

also should this live in catalyst? Seems generally useful.

Yeah, will move it to Catalyst in follow-up PRs.

SparkQA · 2015-02-03T05:30:00Z

Test build #26601 has finished for PR 4308 at commit bcb3ad6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class KafkaUtils(object):
- public class JDBCUtils
- trait Column extends DataFrame with ExpressionApi
- class ColumnName(name: String) extends IncomputableColumn(name)
- trait DataFrame extends DataFrameSpecificApi with RDDApi[Row]
- class GroupedDataFrame protected[sql](df: DataFrameImpl, groupingExprs: Seq[Expression])
- protected[sql] class QueryExecution(val logical: LogicalPlan)
- logWarning(s"Couldn't find class $driver", e);
- implicit class JDBCDataFrame(rdd: DataFrame)
- class DefaultSource extends RelationProvider with SchemaRelationProvider

rxin · 2015-02-03T05:40:24Z

sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala

-  def parquetFile(path: String): DataFrame =
-    DataFrame(this, parquet.ParquetRelation(path, Some(sparkContext.hadoopConfiguration), this))
+  @scala.annotation.varargs
+  def parquetFile(paths: String*): DataFrame =


as commented on the other pr, use

def parquetFile(path: String, paths: String*): DataFrame

to make sure this is not ambiguous if we overload the function with another varargs

Thanks. Makes sense.

Are we actually ever going to do that for this function? This makes it harder to do something like parquetFile(listOfFiles: _*) which I think is actually a common usecase.

i'll add that if we ever do overload this we can do this disambiguation then.

okay and i convinced @rxin too :)

SparkQA · 2015-02-03T21:06:21Z

Test build #26667 has finished for PR 4308 at commit 5584e24.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider with SchemaRelationProvider

SparkQA · 2015-02-04T09:38:28Z

Test build #26734 has finished for PR 4308 at commit ae1ee78.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-02-04T19:30:22Z

retest this please

liancheng · 2015-02-04T19:56:46Z

The last build failure was caused by a flaky ML test case, which is now fixed in master.

SparkQA · 2015-02-04T21:13:43Z

Test build #26769 has finished for PR 4308 at commit ae1ee78.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class CaseInsensitiveMap(map: Map[String, String]) extends Map[String, String]
- trait CreatableRelationProvider

liancheng · 2015-02-04T22:09:11Z

retest this please.

The last build failure reports that isFile and isRoot are not member of org.apache.hadoop.fs.FileStatus, which doesn't make sense (the pull request builder uses Hadoop 2.3.0, and these methods are definitely defined in FileStatus).

marmbrus · 2015-02-05T03:06:46Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

+
+object ParquetRelation2 {
+  // Whether we should merge schemas collected from all Parquet part-files.
+  val MERGE_SCHEMA = "parquet.mergeSchema"


why prefix these with parquet? that seems redudant since you can only use them after specifying USING org.apache.spark.sql.parquet

should we also have an option to turn off caching?

Thanks. Will address these in follow-up PR(s).

liancheng · 2015-02-05T20:08:23Z

Rebased (for the 8th time during the last 72 hours), should be ready to go once Jenkins nods. Will address comments in follow-up PRs.

liancheng · 2015-02-05T21:32:12Z

OK, rebased for the 9th time... Addressed all comments except for adding option to disable metadata caching, which I'd like to include in another PR.

SparkQA · 2015-02-05T21:57:37Z

Test build #26856 has finished for PR 4308 at commit 1ad361e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class CaseInsensitiveMap(map: Map[String, String]) extends Map[String, String]
- trait CreatableRelationProvider

SparkQA · 2015-02-05T23:13:29Z

Test build #26858 has finished for PR 4308 at commit b6946e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- protected class CaseInsensitiveMap(map: Map[String, String]) extends Map[String, String]
- trait CreatableRelationProvider

@rxin

…a source improvements This PR adds three major improvements to Parquet data source: 1. Partition discovery While reading Parquet files resides in Hive style partition directories, `ParquetRelation2` automatically discovers partitioning information and infers partition column types. This is also a partial work for [SPARK-5182] [1], which aims to provide first class partitioning support for the data source API. Related code in this PR can be easily extracted to the data source API level in future versions. 1. Schema merging When enabled, Parquet data source collects schema information from all Parquet part-files and tries to merge them. Exceptions are thrown when incompatible schemas are detected. This feature is controlled by data source option `parquet.mergeSchema`, and is enabled by default. 1. Metastore Parquet table conversion moved to analysis phase This greatly simplifies the conversion logic. `ParquetConversion` strategy can be removed once the old Parquet implementation is removed in the future. This version of Parquet data source aims to entirely replace the old Parquet implementation. However, the old version hasn't been removed yet. Users can fall back to the old version by turning off SQL configuration `spark.sql.parquet.useDataSourceApi`. Other JIRA tickets fixed as side effects in this PR: - [SPARK-5509] [3]: `EqualTo` now uses a proper `Ordering` to compare binary types. - [SPARK-3575] [4]: Metastore schema is now preserved and passed to `ParquetRelation2` via data source option `parquet.metastoreSchema`. TODO: - [ ] More test cases for partition discovery - [x] Fix write path after data source write support (#4294) is merged It turned out to be non-trivial to fall back to old Parquet implementation on the write path when Parquet data source is enabled. Since we're planning to include data source write support in 1.3.0, I simply ignored two test cases involving Parquet insertion for now. - [ ] Fix outdated comments and documentations PS: This PR looks big, but more than a half of the changed lines in this PR are trivial changes to test cases. To test Parquet with and without the new data source, almost all Parquet test cases are moved into wrapper driver functions. This introduces hundreds of lines of changes. [1]: https://issues.apache.org/jira/browse/SPARK-5182 [2]: https://issues.apache.org/jira/browse/SPARK-5528 [3]: https://issues.apache.org/jira/browse/SPARK-5509 [4]: https://issues.apache.org/jira/browse/SPARK-3575  [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4308)  Author: Cheng Lian <lian@databricks.com> Closes #4308 from liancheng/parquet-partition-discovery and squashes the following commits: b6946e6 [Cheng Lian] Fixes MiMA issues, addresses comments 8232e17 [Cheng Lian] Write support for Parquet data source a49bd28 [Cheng Lian] Fixes spelling typo in trait name "CreateableRelationProvider" 808380f [Cheng Lian] Fixes issues introduced while rebasing 50dd8d1 [Cheng Lian] Addresses @rxin's comment, fixes UDT schema merging adf2aae [Cheng Lian] Fixes compilation error introduced while rebasing 4e0175f [Cheng Lian] Fixes Python Parquet API, we need Py4J array to call varargs method 0d8ec1d [Cheng Lian] Adds more test cases b35c8c6 [Cheng Lian] Fixes some typos and outdated comments dd704fd [Cheng Lian] Fixes Python Parquet API 596c312 [Cheng Lian] Uses switch to control whether use Parquet data source or not 7d0f7a2 [Cheng Lian] Fixes Metastore Parquet table conversion a1896c7 [Cheng Lian] Fixes all existing Parquet test suites except for ParquetMetastoreSuite 5654c9d [Cheng Lian] Draft version of Parquet partition discovery and schema merging (cherry picked from commit a9ed511) Signed-off-by: Michael Armbrust <michael@databricks.com>

nugend · 2015-02-27T02:49:15Z

sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala

+  /**
+   * Converts a string to a `Literal` with automatic type inference.  Currently only supports
+   * [[IntegerType]], [[LongType]], [[FloatType]], [[DoubleType]], [[DecimalType.Unlimited]], and
+   * [[StringType]].


Would it be reasonable to support DateType and then StringType. In my experience, breaking data down by a date partition is pretty common and useful. My thinking is that if you see a string in the format YYYY-MM-DD (for example, I do recommend that format given its alphabetical sorting, personally, but it doesn't have to be that), then you can probably safely assume that the partition is intended to be a date.

I'm not super familiar with this code though, so I'll have to defer to others' expertise.

Good point. Trying DataType before StringType makes sense. I can add this. Thanks for the suggestion!

liancheng reviewed Feb 2, 2015
View reviewed changes

liancheng changed the title ~~[SPARK-5182] [SPARK-5528] [SQL] WIP: Parquet data source improvements~~ [SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] WIP: Parquet data source improvements Feb 2, 2015

liancheng force-pushed the parquet-partition-discovery branch 3 times, most recently from 1b11851 to 07599a7 Compare February 3, 2015 03:53

rxin reviewed Feb 3, 2015
View reviewed changes

liancheng force-pushed the parquet-partition-discovery branch from bcb3ad6 to 5584e24 Compare February 3, 2015 19:44

liancheng force-pushed the parquet-partition-discovery branch from 5584e24 to ae1ee78 Compare February 4, 2015 08:04

liancheng changed the title ~~[SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] WIP: Parquet data source improvements~~ [SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] Parquet data source improvements Feb 4, 2015

marmbrus reviewed Feb 5, 2015
View reviewed changes

liancheng force-pushed the parquet-partition-discovery branch from 209f324 to 1ad361e Compare February 5, 2015 20:07

liancheng added 14 commits February 5, 2015 13:25

Draft version of Parquet partition discovery and schema merging

5654c9d

Fixes all existing Parquet test suites except for ParquetMetastoreSuite

a1896c7

Fixes Metastore Parquet table conversion

7d0f7a2

Uses switch to control whether use Parquet data source or not

596c312

Fixes Python Parquet API

dd704fd

Fixes some typos and outdated comments

b35c8c6

Adds more test cases

0d8ec1d

Fixes Python Parquet API, we need Py4J array to call varargs method

4e0175f

Fixes compilation error introduced while rebasing

adf2aae

Addresses @rxin's comment, fixes UDT schema merging

50dd8d1

Fixes issues introduced while rebasing

808380f

Fixes spelling typo in trait name "CreateableRelationProvider"

a49bd28

Write support for Parquet data source

8232e17

Fixes MiMA issues, addresses comments

b6946e6

liancheng force-pushed the parquet-partition-discovery branch from 1ad361e to b6946e6 Compare February 5, 2015 21:31

asfgit closed this in a9ed511 Feb 5, 2015

liancheng deleted the parquet-partition-discovery branch February 5, 2015 23:47

liancheng mentioned this pull request Feb 10, 2015

[SPARK-5684][SQL]: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys) #4469

Closed

nugend reviewed Feb 27, 2015
View reviewed changes

cloud-fan mentioned this pull request Oct 25, 2017

[SPARK-15474][SQL] Write and read back non-emtpy schema with empty dataframe #19571

Closed

[SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] Parquet data source improvements #4308

[SPARK-5182] [SPARK-5528] [SPARK-5509] [SPARK-3575] [SQL] Parquet data source improvements #4308

Uh oh!

Conversation

liancheng commented Feb 2, 2015 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 2, 2015

Uh oh!

SparkQA commented Feb 2, 2015

Uh oh!

SparkQA commented Feb 3, 2015

Uh oh!

SparkQA commented Feb 3, 2015

Uh oh!

SparkQA commented Feb 3, 2015

Uh oh!

SparkQA commented Feb 3, 2015

Uh oh!

SparkQA commented Feb 3, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 3, 2015

Uh oh!

SparkQA commented Feb 4, 2015

Uh oh!

liancheng commented Feb 4, 2015

Uh oh!

liancheng commented Feb 4, 2015

Uh oh!

SparkQA commented Feb 4, 2015

Uh oh!

liancheng commented Feb 4, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liancheng commented Feb 5, 2015

Uh oh!

liancheng commented Feb 5, 2015

Uh oh!

SparkQA commented Feb 5, 2015

Uh oh!

SparkQA commented Feb 5, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

liancheng commented Feb 2, 2015 •

edited

Loading