forked from apache/spark
-
Notifications
You must be signed in to change notification settings - Fork 4
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions
### What changes were proposed in this pull request? Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. This patch has unit tests but not integration tests; really needs to test the SQL commands through the manifest committer into gcs/abfs, or at least local fs. That would be possible once hadoop 3.3.5 is out... Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. ### Why are the changes needed? Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores). The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it. Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so. (apache/hadoop#4728) ### Does this PR introduce _any_ user-facing change? No. There is documentation on the feature in the hadoop [manifest committer](https://github.com/apache/hadoop/blob/82372d0d22e696643ad97490bc902fb6d17a6382/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/site/markdown/manifest_committer.md) docs. ### How was this patch tested? 1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change. 2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility Those new integration tests include * spark sql test derived from spark's own [CloudRelationBasicSuite.scala#L212](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/org/apache/spark/sql/sources/CloudRelationBasicSuite.scala#L212) * Dataset tests extended to verify support for/rejection of dynamic partition overwrite [AbstractCommitDataframeSuite.scala#L151](https://github.com/hortonworks-spark/cloud-integration/blob/master/cloud-examples/src/test/scala/com/cloudera/spark/cloud/committers/AbstractCommitDataframeSuite.scala#L151) Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites) Closes apache#37468 from steveloughran/SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
- Loading branch information
1 parent
e83aedd
commit 5a599de
Showing
5 changed files
with
374 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.