You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-42779][SQL] Allow V2 writes to indicate advisory shuffle partition size
### What changes were proposed in this pull request?
This PR adds an API for data sources to indicate the advisory partition size for V2 writes.
### Why are the changes needed?
Data sources have an API to request a particular distribution and ordering of data for V2 writes. If AQE is enabled, the default session advisory partition size (64MB) will be used as target. Unfortunately, this default value is still suboptimal and can lead to small files because the written data can be compressed nicely using columnar file formats. Spark should allow data sources to indicate the advisory shuffle partition size, just like it lets data sources request a particular number of partitions. This feature would allow data sources to estimate the compression ratio and incorporate that in the requested advisory partition size.
### Does this PR introduce _any_ user-facing change?
Yes. However, the changes are backward compatible.
### How was this patch tested?
This PR extends the existing tests for V2 write distribution and ordering.
Closesapache#40421 from aokolnychyi/spark-42779.
Lead-authored-by: aokolnychyi <aokolnychyi@apple.com>
Co-authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
"The advisory partition size can't be specified with unspecified distribution."
1084
+
]
1085
+
}
1086
+
}
1087
+
},
1066
1088
"LOCATION_ALREADY_EXISTS" : {
1067
1089
"message" : [
1068
1090
"Cannot name the managed table as <identifier>, as its associated location <location> already exists. Please pick a different table name, or remove the existing location first."
@@ -2931,11 +2953,6 @@
2931
2953
"Unsupported data type <dataType>."
2932
2954
]
2933
2955
},
2934
-
"_LEGACY_ERROR_TEMP_1178" : {
2935
-
"message" : [
2936
-
"The number of partitions can't be specified with unspecified distribution. Invalid writer requirements detected."
2937
-
]
2938
-
},
2939
2956
"_LEGACY_ERROR_TEMP_1181" : {
2940
2957
"message" : [
2941
2958
"Stream-stream join without equality predicate is not supported."
0 commit comments