Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Nov 20, 2024

Which issue does this PR close?

Closes #1102.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

@viirya viirya changed the title Support partition values in feature branch comet-parquet-exec fix: Support partition values in feature branch comet-parquet-exec Nov 20, 2024
val dataSchemaParquet =
new SparkToParquetSchemaConverter(conf).convert(scan.relation.dataSchema)
val partitionSchemaParquet =
new SparkToParquetSchemaConverter(conf).convert(scan.relation.partitionSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1103 discusses how the schemas have already lost necessary information at this point. Should we construct a new partition schema from the true Parquet schema rather than the partitionSchema that may have lost/converted type information already?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copies from existing code.

Actually, I can just convert the Spark schema to Arrow types in JVM and serialize it to native side. I did similar thing in shuffle writer. Then we won't lose any information.

int64 start = 2;
int64 length = 3;
int64 file_size = 4;
repeated spark.spark_expression.Expr partition_values = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aren't the partition values just strings?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. Although for Hive partitioned table, partition values are dictionary names which are strings, but once Spark reads these strings back, they are casted to corresponding data types of partition columns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah makes sense.

@viirya viirya merged commit c3ad26e into apache:comet-parquet-exec Nov 22, 2024
23 of 74 checks passed
@viirya viirya deleted the partition_values branch November 22, 2024 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants