[SPARK-30289][SQL][TEST] Partitioned by Nested Column for `InMemoryTable` #26929

dbtsai · 2019-12-18T00:41:39Z

What changes were proposed in this pull request?

InMemoryTable was flatting the nested columns, and then the flatten columns was used to look up the indices which is not correct.

This PR implements partitioned by nested column for InMemoryTable.

Why are the changes needed?

This PR implements partitioned by nested column for InMemoryTable, so we can test this features in DSv2

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit tests and new tests.

dbtsai · 2019-12-18T00:46:27Z

cc @cloud-fan @rdblue @dongjoon-hyun @viirya @gengliangwang

dbtsai · 2019-12-18T00:48:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

@@ -59,8 +60,11 @@ class InMemoryTable(

  def rows: Seq[InternalRow] = dataMap.values.flatMap(_.rows).toSeq

-  private val partFieldNames = partitioning.flatMap(_.references).toSeq.flatMap(_.fieldNames)


The nested columns were flatten out here, and then we looked them up against top level columns resulting IllegalArgumentException.

SparkQA · 2019-12-18T00:51:05Z

Test build #115475 has finished for PR 26929 at commit 98fec47.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileTableSuite.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileTableSuite.scala

rdblue · 2019-12-18T01:36:37Z

I'll take a look at this tomorrow, but I think that not allowing partitioning by nested columns isn't the right solution. Iceberg can partition by columns nested in structs, but not columns inside lists or maps. Since it is just a logical grouping, I see no reason why it shouldn't be allowed. I think we just need to update the analysis check.

rdblue · 2019-12-18T01:40:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/V1Table.scala

@@ -78,6 +78,8 @@ private[sql] case class V1Table(v1Table: CatalogTable) extends Table {
      partitions += spec.asTransform
    }

+    import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.TransformHelper
+    partitions.validatePartitionColumns()


I agree that v1 tables should throw an exception if nested columns are used because it wasn't supported in v1, but there is no need to disallow nested columns in all v2 sources.

Digging the v2 codebase a bit, and I feel currently, we need some work to actually get v2 sources supporting using nested columns in transform. For example, in FileTable, override def partitioning: Array[Transform] is converted from PartitionSpec which doesn't support nested column at all. Since we are mixing v1 and v2 code here and there, do we have a plan to untangle them so it will be easier to extend nested column support in v2?

FileTable doesn't support partitioning by nested columns, and that's okay because it is optional. Sources should just reject partitioning that is not supported.

rdblue · 2019-12-18T01:42:24Z

@dbtsai, is the intent of this to disable nested columns for just the sources that can't handle them (In-memory, file sources, and v1) or is it to disable them more broadly?

cloud-fan · 2019-12-18T10:08:42Z

sql/core/src/test/scala/org/apache/spark/sql/connector/DataSourceV2SQLSuite.scala

+    val t1 = s"${catalogAndNamespace}tbl"
+    withTable(t1) {
+      val e = intercept[IllegalArgumentException] {
+        sql(s"CREATE TABLE $t1 (nested struct<id:bigint, data:string>) " +


we don't support CREATE TABLE USING fileSourceV2 now, we only need to fix InMemoryTable.

cloud-fan · 2019-12-18T10:09:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileTable.scala

@@ -39,6 +40,9 @@ abstract class FileTable(
    userSpecifiedSchema: Option[StructType])
  extends Table with SupportsRead with SupportsWrite {

+  // If `partitioning` contains nested columns, an `AnalysisException` will be thrown
+  partitioning.toSeq.validatePartitionColumns()


The partitioning here is always inferred, so this check always passes.

Yeah seems like so. Even after we support passing the partitioning from TableProvider, the partitioning columns should be validated in Catalog already.

viirya

SQL queries might be weird if partitioning by nested columns, e.g.,

We can use top columns like:
INSERT OVERWRITE TABLE table_with_partition PARTITION (p1='a', p2='b') SELECT 'blarr' FROM tmp_table

For nested columns:
INSERT OVERWRITE TABLE table_with_partition PARTITION (p1='a', struct.sub1='b') SELECT 'blarr' FROM tmp_table

dbtsai · 2020-01-14T01:10:41Z

I was on paternity leave, and sorry for the late reply.

@rdblue the intent of this was to disable partition by nested columns because I saw some inconsistencies in the v2 codebase when I tried to create a new v2 filter api, and I thought partition by nested columns is not supported at all.

For example, in org.apache.spark.sql.catalyst.catalog.CatalogTable, we have partitionColumnNames: Seq[String] as partition columns instead of using NamedReference, how do we properly support using nested column as partitioning? Are we paring the string that contains . as a nested column? Can you give me an example?

@cloud-fan do you mean fix InMemoryTable so it supports it properly?

rdblue · 2020-01-15T18:10:47Z

@dbtsai, Iceberg supports partitioning by fields in structs. We think of structs as a logical grouping of columns because values are still 1-to-1 with the row.

Hive tables don't support partitioning by nested fields, which is why Hive tables should reject partition expressions that use nested fields. It's up to the catalog and table implementation to determine what is supported.

The parser will recognize any multi-part identifier as a column in a partition expression. It will split on . and also supports back-ticks for quoting like normal column and table identifiers.

dbtsai · 2020-01-16T00:14:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala

      val (idTransforms, nonIdTransforms) = transforms.partition(_.isInstanceOf[IdentityTransform])

      if (nonIdTransforms.nonEmpty) {
        throw new AnalysisException("Transforms cannot be converted to partition columns: " +
-            nonIdTransforms.map(_.describe).mkString(", "))
+          nonIdTransforms.map(_.describe).mkString(", "))


@rdblue do we allow to use bucket transform as a partition column? It's not allowed in ResolveSessionCatalog.scala, but there is a test in DataFrameWriterV2Suite.scala testing test("Create: partitioned by bucket(4, id)").

Note that in that test, there is a table property to enable "allow-unsupported-transforms", what's the usecase here?

That table property allows is used to make the test table implementation accept configuration that it doesn't support when writing. It's used to test that the table was passed the right Transform, even though the InMemoryTable only supports identity transforms.

ResolveSessionCatalog should convert bucket Transforms to and from BucketSpec.

cloud-fan · 2020-01-16T06:05:25Z

I think Spark should support all kinds of PARTITION BY expressions as long as it can be translated to v2 Transform. The catalog implementation should decide if they support it or not. For examaple, Hive catalog doesn't support partition by nested columns.

For the particular test failure, I think we should fix InMemoryTable that, when flatten the fields, we should keep the full column path not just the name.

SparkQA · 2020-02-11T05:29:02Z

Test build #118193 has finished for PR 26929 at commit cad1ab7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2020-02-13T23:00:53Z

Ping @cloud-fan @rdblue @dongjoon-hyun @viirya @gengliangwang again. I fixed InMemoryTable so it accepts nested cols as partition cols with tests.

rdblue · 2020-02-14T00:04:55Z

+1

Thanks for updating tests, @dbtsai. This looks good to me and it's great to have cases for partitioning by nested fields.

viirya · 2020-02-14T01:08:18Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

+        value
+      }
+    }
+    partCols.map(filedNames => extractor(filedNames, schema, row))


filedNames? fieldNames?

Fixed. Thanks.

viirya · 2020-02-14T01:24:29Z

Looks good and thanks for working on this.

SparkQA · 2020-02-14T03:10:24Z

Test build #118379 has finished for PR 26929 at commit 259d9d3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-14T08:48:42Z

sql/catalyst/src/test/scala/org/apache/spark/sql/connector/InMemoryTable.scala

+      if (fieldNames.length > 1) {
+        (value, schema(index).dataType) match {
+          case (row: InternalRow, nestedSchema: StructType) =>
+            extractor(fieldNames.slice(1, fieldNames.length), nestedSchema, row)


nit: fieldNames.drop(1)

Thanks. Addressed.

SparkQA · 2020-02-14T14:05:23Z

Test build #118408 has finished for PR 26929 at commit 21ebd26.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-02-14T14:46:05Z

Test build #118411 has finished for PR 26929 at commit e2cd87f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

### What changes were proposed in this pull request? 1. `InMemoryTable` was flatting the nested columns, and then the flatten columns was used to look up the indices which is not correct. This PR implements partitioned by nested column for `InMemoryTable`. ### Why are the changes needed? This PR implements partitioned by nested column for `InMemoryTable`, so we can test this features in DSv2 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests and new tests. Closes #26929 from dbtsai/addTests. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com> (cherry picked from commit d0f9614) Signed-off-by: DB Tsai <d_tsai@apple.com>

dbtsai · 2020-02-14T21:46:46Z

Thanks. Merged into master and branch-3.0.

### What changes were proposed in this pull request? 1. `InMemoryTable` was flatting the nested columns, and then the flatten columns was used to look up the indices which is not correct. This PR implements partitioned by nested column for `InMemoryTable`. ### Why are the changes needed? This PR implements partitioned by nested column for `InMemoryTable`, so we can test this features in DSv2 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing unit tests and new tests. Closes apache#26929 from dbtsai/addTests. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>

dbtsai added the SQL label Dec 18, 2019

dbtsai force-pushed the addTests branch 2 times, most recently from 39bdcf1 to 98fec47 Compare December 18, 2019 00:42

dbtsai commented Dec 18, 2019

View reviewed changes

dongjoon-hyun reviewed Dec 18, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileTableSuite.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 18, 2019

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Dec 18, 2019

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/v2/FileTableSuite.scala Outdated Show resolved Hide resolved

rdblue reviewed Dec 18, 2019

View reviewed changes

This comment has been minimized.

Sign in to view

cloud-fan reviewed Dec 18, 2019

View reviewed changes

viirya reviewed Dec 19, 2019

View reviewed changes

HyukjinKwon mentioned this pull request Dec 24, 2019

[SPARK-27181][SQL]: Add public transform API #24117

Closed

aokolnychyi mentioned this pull request Jan 2, 2020

Fix Iceberg Reader for nested partitions (#575) apache/iceberg#585

Closed

dbtsai commented Jan 16, 2020

View reviewed changes

dbtsai force-pushed the addTests branch from 5612e7b to 3e2c95a Compare February 7, 2020 22:08

dbtsai changed the title ~~[SPARK-30289][SQL] DSv2's partitioning should not accept nested columns~~ [SPARK-30289][SQL] Partitioned by Nested Column for InMemoryTable Feb 7, 2020

This comment has been minimized.

Sign in to view

DSV2

cad1ab7

dbtsai force-pushed the addTests branch from 3e2c95a to cad1ab7 Compare February 11, 2020 00:57

refacting

259d9d3

viirya reviewed Feb 14, 2020

View reviewed changes

viirya approved these changes Feb 14, 2020

View reviewed changes

dbtsai added 2 commits February 14, 2020 00:38

fix typo

ef4e21a

typo

21ebd26

cloud-fan reviewed Feb 14, 2020

View reviewed changes

cloud-fan approved these changes Feb 14, 2020

View reviewed changes

syntax sugar

e2cd87f

dbtsai closed this in d0f9614 Feb 14, 2020

dbtsai deleted the addTests branch February 18, 2020 23:40

gatorsmile changed the title ~~[SPARK-30289][SQL] Partitioned by Nested Column for InMemoryTable~~ [SPARK-30289][TEST] Partitioned by Nested Column for InMemoryTable Feb 19, 2020

gatorsmile changed the title ~~[SPARK-30289][TEST] Partitioned by Nested Column for InMemoryTable~~ [SPARK-30289][SQL][TEST] Partitioned by Nested Column for InMemoryTable Feb 19, 2020

		@@ -59,8 +60,11 @@ class InMemoryTable(

		def rows: Seq[InternalRow] = dataMap.values.flatMap(_.rows).toSeq

		private val partFieldNames = partitioning.flatMap(_.references).toSeq.flatMap(_.fieldNames)

[SPARK-30289][SQL][TEST] Partitioned by Nested Column for InMemoryTable #26929

[SPARK-30289][SQL][TEST] Partitioned by Nested Column for InMemoryTable #26929

Uh oh!

Conversation

dbtsai commented Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

dbtsai commented Dec 18, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 18, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue commented Dec 18, 2019

Uh oh!

rdblue Dec 18, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Dec 18, 2019

Uh oh!

This comment has been minimized.

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Jan 14, 2020

Uh oh!

rdblue commented Jan 15, 2020

Uh oh!

dbtsai Jan 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 16, 2020

Uh oh!

This comment has been minimized.

SparkQA commented Feb 11, 2020

Uh oh!

dbtsai commented Feb 13, 2020

Uh oh!

rdblue commented Feb 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

SparkQA commented Feb 14, 2020

Uh oh!

dbtsai commented Feb 14, 2020

Uh oh!

[SPARK-30289][SQL][TEST] Partitioned by Nested Column for `InMemoryTable` #26929

[SPARK-30289][SQL][TEST] Partitioned by Nested Column for `InMemoryTable` #26929

dbtsai commented Dec 18, 2019 •

edited

Loading

rdblue Dec 18, 2019 •

edited

Loading

dbtsai Jan 16, 2020 •

edited

Loading