Skip to content

Revert optimization to reorder columns in parquet writer #17978

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 21, 2023

Conversation

raunaqmorarka
Copy link
Member

@raunaqmorarka raunaqmorarka commented Jun 20, 2023

Description

Some files produced by this optimization were ignored by Apache Spark.
Related to https://trinodb.slack.com/archives/CGB0QHWSW/p1687254718875619
Some versions of Databricks Runtime produce an exception when reading files with re-ordered columns.
Related to https://trinodb.slack.com/archives/CP1MUNEUX/p1685685945379909

Additional context and related issues

Original PR #17404
The files produced after this change were found to be sometimes ignored by Apache Spark
and produce exceptions on DBR 12.2 LTS.
Apache Hive and trino parquet reader did not have problems reading any of the new files.

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text:

# Hive, Delta, Hudi, Iceberg
* Fix parquet writer compatibility with Apache Spark and Databricks Runtime. ({issue}`17978`)

@findinpath
Copy link
Contributor

Attaching zipped version of the problematic Parquet file retrieved from the original Slack discussion:

20230620_084514_00081_uva2j-f9aec2f4-825d-4f2a-af24-414e1cfdb404.zip

Some files produced by this optimization were ignored by Apache Spark.
Some versions of Databricks Runtime produce an exception when reading
files with re-ordered columns.
@raunaqmorarka raunaqmorarka merged commit 91a41a8 into trinodb:master Jun 21, 2023
@raunaqmorarka raunaqmorarka deleted the pqw-revert branch June 21, 2023 05:49
@github-actions github-actions bot added this to the 420 milestone Jun 21, 2023
raunaqmorarka added a commit to raunaqmorarka/presto that referenced this pull request Jun 26, 2023
Reproduces the problem fixed by trinodb#17978
by using CTAS on an existing file which reliably reproduced the problem
and then attempting to read the resulting table through Apache Hive and Spark
raunaqmorarka added a commit that referenced this pull request Jun 27, 2023
Reproduces the problem fixed by #17978
by using CTAS on an existing file which reliably reproduced the problem
and then attempting to read the resulting table through Apache Hive and Spark
nelsonspark pushed a commit to nelsonspark/trino that referenced this pull request Jun 30, 2023
Reproduces the problem fixed by trinodb#17978
by using CTAS on an existing file which reliably reproduced the problem
and then attempting to read the resulting table through Apache Hive and Spark
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants