[SPARK-5684][SQL]: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys) #4469

saucam · 2015-02-09T09:08:59Z

While parsing the partition keys from the locations, in parquetRelations, it is assumed that location path string will always contain the partition keys, which is not true. Different location can be specified while adding partitions to the table, which results in key not found exception while reading from such partitions:

Create a partitioned parquet table :
create table test_table (dummy string) partitioned by (timestamp bigint) stored as parquet;
Add a partition to the table and specify a different location:
alter table test_table add partition (timestamp=9) location '/data/pth/different'
Run a simple select * query
we get an exception :
15/02/09 08:27:25 ERROR thriftserver.SparkSQLDriver: Failed in [select * from db4_mi2mi_binsrc1_default limit 5]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 21, localhost): java
.util.NoSuchElementException: key not found: timestamp
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:58)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:58)
at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
at org.apache.spark.sql.parquet.ParquetTableScan$$anonfun$execute$4$$anonfun$6.apply(ParquetTableOperations.scala:141)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

saucam · 2015-02-09T09:21:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/types/dataTypes.scala

@@ -362,7 +362,7 @@ case object BooleanType extends NativeType with PrimitiveType {
 * @group dataType
 */
 @DeveloperApi
-case object TimestampType extends NativeType {
+case object TimestampType extends NativeType with PrimitiveType {


this is done, in case table is partitioned on a timestamp type column, parquet iterator returns a GenericRow due to this in ParquetTypes.scala :

def isPrimitiveType(ctype: DataType): Boolean =
classOf[PrimitiveType] isAssignableFrom ctype.getClass

and in ParquetConverter.scala we have :

protected[parquet] def createRootConverter(
parquetSchema: MessageType,
attributes: Seq[Attribute]): CatalystConverter = {
// For non-nested types we use the optimized Row converter
if (attributes.forall(a => ParquetTypesConverter.isPrimitiveType(a.dataType))) {
new CatalystPrimitiveRowConverter(attributes.toArray)
} else {
new CatalystGroupConverter(attributes.toArray)
}
}

which fails here later :

new Iterator[Row] {
def hasNext = iter.hasNext
def next() = {
val row = iter.next()._2.asInstanceOf[SpecificMutableRow]

throwing a class cast exception that GenericRow cannot be cast to SpecificMutableRow

Am I missing something here ?

saucam · 2015-02-09T09:26:26Z

@liancheng please suggest ...

sryza · 2015-02-09T18:02:38Z

Mind tagging this with [SQL] so it can get properly sorted?

liancheng · 2015-02-10T19:56:57Z

ok to test

liancheng · 2015-02-10T20:04:43Z

Hey @saucam, partitioning support for the old Parquet support is quite limited (only handles 1 partition column, whose type must be INT). PR #4308 and upcoming follow-up PRs aim to provide full support for multi-level partitioning and schema merging. Also, Parquet tables converted from Hive metastore tables will retain their schema and location information inherited from metastore. We plan to deprecate the old Parquet implementation by the new Parquet data source in 1.3, and would like to remove the old one once the new implementation is proved to be stable enough.

SparkQA · 2015-02-10T21:04:21Z

Test build #27224 has finished for PR 4469 at commit 30fdcec.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

saucam · 2015-02-11T06:40:31Z

Hi @liancheng , thanks for the comments. We are using spark-1.2.1 and the old parquet support is being used. Can this be merged so that we have proper partitioning with different locations as well. I tried partitioning on 2 columns and it worked fine (Also applied this patch for specifying a different location)

SparkQA · 2015-02-20T17:24:55Z

Test build #27778 has finished for PR 4469 at commit 2dd9dbb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

saucam · 2015-03-09T10:16:26Z

hi @liancheng , any update on this one ? i think it will be useful for people using spark 1.2.1 since old parquet path might suit their needs better in that version

marmbrus · 2015-04-03T00:23:10Z

hey @saucam, I'm pretty hesitant to make big changes to branch-1.2 unless a lot of users are reporting a problem. Do the problems you describe still exist in branch-1.3? or should be close this issue?

…s the location can be different (that is may not contain the partition keys)

saucam · 2015-04-04T16:52:31Z

Hi @marmbrus , this is a pretty common scenario in production, where the data is generated in some directory and then later partitions are added to tables using alter table add partition (=value) location <directory where data is generated (where path does not contain partition key=value)>
In the old parquet path in v1.2.1, this is not possible.
This is doable in the new parquet path in spark 1.3 though.

saucam reviewed Feb 9, 2015
View reviewed changes

saucam force-pushed the partition_bug branch from 1eab60c to 30fdcec Compare February 10, 2015 11:02

saucam force-pushed the partition_bug branch from 30fdcec to 2dd9dbb Compare February 20, 2015 16:03

Yash Datta added 2 commits April 4, 2015 21:51

SPARK-5684: Pass in partition name along with location information, a…

bc2f6b1

…s the location can be different (that is may not contain the partition keys)

SPARK-5684: Change the way partition values are passed to ParquetScan

6cc095c

saucam force-pushed the partition_bug branch from 2dd9dbb to 6cc095c Compare April 4, 2015 16:52

asfgit closed this in 0cc8fcb Apr 12, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-5684][SQL]: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys) #4469

[SPARK-5684][SQL]: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys) #4469

Uh oh!

saucam commented Feb 9, 2015

Uh oh!

saucam Feb 9, 2015

Uh oh!

saucam commented Feb 9, 2015

Uh oh!

sryza commented Feb 9, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

SparkQA commented Feb 10, 2015

Uh oh!

saucam commented Feb 11, 2015

Uh oh!

SparkQA commented Feb 20, 2015

Uh oh!

saucam commented Mar 9, 2015

Uh oh!

marmbrus commented Apr 3, 2015

Uh oh!

saucam commented Apr 4, 2015

Uh oh!

Uh oh!

[SPARK-5684][SQL]: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys) #4469

[SPARK-5684][SQL]: Pass in partition name along with location information, as the location can be different (that is may not contain the partition keys) #4469

Uh oh!

Conversation

saucam commented Feb 9, 2015

Uh oh!

saucam Feb 9, 2015

Choose a reason for hiding this comment

Uh oh!

saucam commented Feb 9, 2015

Uh oh!

sryza commented Feb 9, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

liancheng commented Feb 10, 2015

Uh oh!

SparkQA commented Feb 10, 2015

Uh oh!

saucam commented Feb 11, 2015

Uh oh!

SparkQA commented Feb 20, 2015

Uh oh!

saucam commented Mar 9, 2015

Uh oh!

marmbrus commented Apr 3, 2015

Uh oh!

saucam commented Apr 4, 2015

Uh oh!

Uh oh!