Skip to content

Commit db9e1ac

Browse files
Fokkodongjoon-hyun
authored andcommitted
[SPARK-48177][BUILD] Upgrade Apache Parquet to 1.14.1
### What changes were proposed in this pull request? ### Why are the changes needed? Fixes quite a few bugs on the Parquet side: https://github.com/apache/parquet-mr/blob/master/CHANGES.md#version-1140 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Using the existing unit tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #46447 from Fokko/fd-bump-parquet. Authored-by: Fokko Driesprong <fokko@tabular.io> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
1 parent 4ee37ed commit db9e1ac

File tree

9 files changed

+647
-646
lines changed

9 files changed

+647
-646
lines changed

dev/deps/spark-deps-hadoop-3-hive-2.3

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,7 @@ jackson-core/2.17.1//jackson-core-2.17.1.jar
108108
jackson-databind/2.17.1//jackson-databind-2.17.1.jar
109109
jackson-dataformat-cbor/2.17.1//jackson-dataformat-cbor-2.17.1.jar
110110
jackson-dataformat-yaml/2.17.1//jackson-dataformat-yaml-2.17.1.jar
111+
jackson-datatype-jdk8/2.17.0//jackson-datatype-jdk8-2.17.0.jar
111112
jackson-datatype-jsr310/2.17.1//jackson-datatype-jsr310-2.17.1.jar
112113
jackson-mapper-asl/1.9.13//jackson-mapper-asl-1.9.13.jar
113114
jackson-module-scala_2.13/2.17.1//jackson-module-scala_2.13-2.17.1.jar
@@ -235,12 +236,12 @@ orc-shims/2.0.1//orc-shims-2.0.1.jar
235236
oro/2.0.8//oro-2.0.8.jar
236237
osgi-resource-locator/1.0.3//osgi-resource-locator-1.0.3.jar
237238
paranamer/2.8//paranamer-2.8.jar
238-
parquet-column/1.13.1//parquet-column-1.13.1.jar
239-
parquet-common/1.13.1//parquet-common-1.13.1.jar
240-
parquet-encoding/1.13.1//parquet-encoding-1.13.1.jar
241-
parquet-format-structures/1.13.1//parquet-format-structures-1.13.1.jar
242-
parquet-hadoop/1.13.1//parquet-hadoop-1.13.1.jar
243-
parquet-jackson/1.13.1//parquet-jackson-1.13.1.jar
239+
parquet-column/1.14.1//parquet-column-1.14.1.jar
240+
parquet-common/1.14.1//parquet-common-1.14.1.jar
241+
parquet-encoding/1.14.1//parquet-encoding-1.14.1.jar
242+
parquet-format-structures/1.14.1//parquet-format-structures-1.14.1.jar
243+
parquet-hadoop/1.14.1//parquet-hadoop-1.14.1.jar
244+
parquet-jackson/1.14.1//parquet-jackson-1.14.1.jar
244245
pickle/1.5//pickle-1.5.jar
245246
py4j/0.10.9.7//py4j-0.10.9.7.jar
246247
remotetea-oncrpc/1.1.2//remotetea-oncrpc-1.1.2.jar

pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -137,7 +137,7 @@
137137
<kafka.version>3.7.0</kafka.version>
138138
<!-- After 10.17.1.0, the minimum required version is JDK19 -->
139139
<derby.version>10.16.1.1</derby.version>
140-
<parquet.version>1.13.1</parquet.version>
140+
<parquet.version>1.14.1</parquet.version>
141141
<orc.version>2.0.1</orc.version>
142142
<orc.classifier>shaded-protobuf</orc.classifier>
143143
<jetty.version>11.0.21</jetty.version>

sql/core/benchmarks/BuiltInDataSourceWriteBenchmark-jdk21-results.txt

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2,69 +2,69 @@
22
Parquet writer benchmark
33
================================================================================================
44

5-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
5+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
66
AMD EPYC 7763 64-Core Processor
77
Parquet(PARQUET_1_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
88
------------------------------------------------------------------------------------------------------------------------
9-
Output Single Int Column 1839 1907 96 8.6 116.9 1.0X
10-
Output Single Double Column 1832 1841 13 8.6 116.5 1.0X
11-
Output Int and String Column 4356 4494 195 3.6 277.0 0.4X
12-
Output Partitions 3233 3303 99 4.9 205.5 0.6X
13-
Output Buckets 4393 4506 160 3.6 279.3 0.4X
9+
Output Single Int Column 1732 1745 19 9.1 110.1 1.0X
10+
Output Single Double Column 1754 1758 7 9.0 111.5 1.0X
11+
Output Int and String Column 4309 4363 76 3.7 273.9 0.4X
12+
Output Partitions 3252 3350 139 4.8 206.8 0.5X
13+
Output Buckets 4487 4575 124 3.5 285.3 0.4X
1414

15-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
15+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
1616
AMD EPYC 7763 64-Core Processor
1717
Parquet(PARQUET_2_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
1818
------------------------------------------------------------------------------------------------------------------------
19-
Output Single Int Column 2057 2066 13 7.6 130.8 1.0X
20-
Output Single Double Column 1805 1813 11 8.7 114.8 1.1X
21-
Output Int and String Column 4771 4775 6 3.3 303.3 0.4X
22-
Output Partitions 3337 3339 3 4.7 212.2 0.6X
23-
Output Buckets 4441 4463 31 3.5 282.3 0.5X
19+
Output Single Int Column 1938 1978 55 8.1 123.2 1.0X
20+
Output Single Double Column 1762 1769 10 8.9 112.0 1.1X
21+
Output Int and String Column 4920 4932 17 3.2 312.8 0.4X
22+
Output Partitions 3385 3389 7 4.6 215.2 0.6X
23+
Output Buckets 4528 4538 14 3.5 287.9 0.4X
2424

2525

2626
================================================================================================
2727
ORC writer benchmark
2828
================================================================================================
2929

30-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
30+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
3131
AMD EPYC 7763 64-Core Processor
3232
ORC writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3333
------------------------------------------------------------------------------------------------------------------------
34-
Output Single Int Column 1144 1168 35 13.8 72.7 1.0X
35-
Output Single Double Column 1612 1628 22 9.8 102.5 0.7X
36-
Output Int and String Column 3911 3915 7 4.0 248.6 0.3X
37-
Output Partitions 2600 2648 67 6.0 165.3 0.4X
38-
Output Buckets 3449 3477 40 4.6 219.3 0.3X
34+
Output Single Int Column 1137 1142 7 13.8 72.3 1.0X
35+
Output Single Double Column 1700 1705 6 9.3 108.1 0.7X
36+
Output Int and String Column 4028 4096 97 3.9 256.1 0.3X
37+
Output Partitions 2562 2582 28 6.1 162.9 0.4X
38+
Output Buckets 3524 3530 9 4.5 224.1 0.3X
3939

4040

4141
================================================================================================
4242
JSON writer benchmark
4343
================================================================================================
4444

45-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
45+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
4646
AMD EPYC 7763 64-Core Processor
4747
JSON writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4848
------------------------------------------------------------------------------------------------------------------------
49-
Output Single Int Column 1627 1636 13 9.7 103.4 1.0X
50-
Output Single Double Column 2389 2390 1 6.6 151.9 0.7X
51-
Output Int and String Column 4283 4299 22 3.7 272.3 0.4X
52-
Output Partitions 3171 3192 29 5.0 201.6 0.5X
53-
Output Buckets 4120 4124 6 3.8 261.9 0.4X
49+
Output Single Int Column 1618 1645 37 9.7 102.9 1.0X
50+
Output Single Double Column 2398 2399 1 6.6 152.5 0.7X
51+
Output Int and String Column 3766 3778 17 4.2 239.5 0.4X
52+
Output Partitions 3162 3164 3 5.0 201.0 0.5X
53+
Output Buckets 4015 4028 18 3.9 255.3 0.4X
5454

5555

5656
================================================================================================
5757
CSV writer benchmark
5858
================================================================================================
5959

60-
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1018-azure
60+
OpenJDK 64-Bit Server VM 21.0.3+9-LTS on Linux 6.5.0-1022-azure
6161
AMD EPYC 7763 64-Core Processor
6262
CSV writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
6363
------------------------------------------------------------------------------------------------------------------------
64-
Output Single Int Column 3536 3557 31 4.4 224.8 1.0X
65-
Output Single Double Column 3863 3894 44 4.1 245.6 0.9X
66-
Output Int and String Column 6363 6377 19 2.5 404.5 0.6X
67-
Output Partitions 5128 5148 29 3.1 326.0 0.7X
68-
Output Buckets 6613 6626 18 2.4 420.5 0.5X
64+
Output Single Int Column 3985 3993 11 3.9 253.4 1.0X
65+
Output Single Double Column 4148 4210 88 3.8 263.7 1.0X
66+
Output Int and String Column 6728 6741 18 2.3 427.8 0.6X
67+
Output Partitions 5431 5447 23 2.9 345.3 0.7X
68+
Output Buckets 6927 6942 22 2.3 440.4 0.6X
6969

7070

sql/core/benchmarks/BuiltInDataSourceWriteBenchmark-results.txt

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -2,69 +2,69 @@
22
Parquet writer benchmark
33
================================================================================================
44

5-
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
5+
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
66
AMD EPYC 7763 64-Core Processor
77
Parquet(PARQUET_1_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
88
------------------------------------------------------------------------------------------------------------------------
9-
Output Single Int Column 1778 1861 116 8.8 113.1 1.0X
10-
Output Single Double Column 1750 1757 10 9.0 111.2 1.0X
11-
Output Int and String Column 4290 4408 167 3.7 272.8 0.4X
12-
Output Partitions 3089 3259 240 5.1 196.4 0.6X
13-
Output Buckets 4269 4289 29 3.7 271.4 0.4X
9+
Output Single Int Column 1813 1881 96 8.7 115.3 1.0X
10+
Output Single Double Column 1976 1977 1 8.0 125.6 0.9X
11+
Output Int and String Column 4403 4438 50 3.6 279.9 0.4X
12+
Output Partitions 3388 3421 46 4.6 215.4 0.5X
13+
Output Buckets 4670 4680 15 3.4 296.9 0.4X
1414

15-
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
15+
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
1616
AMD EPYC 7763 64-Core Processor
1717
Parquet(PARQUET_2_0) writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
1818
------------------------------------------------------------------------------------------------------------------------
19-
Output Single Int Column 1731 1744 19 9.1 110.0 1.0X
20-
Output Single Double Column 1803 1804 2 8.7 114.6 1.0X
21-
Output Int and String Column 4665 4672 10 3.4 296.6 0.4X
22-
Output Partitions 3290 3308 26 4.8 209.2 0.5X
23-
Output Buckets 4261 4327 93 3.7 270.9 0.4X
19+
Output Single Int Column 1903 1926 33 8.3 121.0 1.0X
20+
Output Single Double Column 1998 1998 0 7.9 127.0 1.0X
21+
Output Int and String Column 4916 4936 29 3.2 312.6 0.4X
22+
Output Partitions 3366 3375 13 4.7 214.0 0.6X
23+
Output Buckets 4560 4583 33 3.4 289.9 0.4X
2424

2525

2626
================================================================================================
2727
ORC writer benchmark
2828
================================================================================================
2929

30-
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
30+
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
3131
AMD EPYC 7763 64-Core Processor
3232
ORC writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
3333
------------------------------------------------------------------------------------------------------------------------
34-
Output Single Int Column 1072 1075 4 14.7 68.1 1.0X
35-
Output Single Double Column 1579 1580 0 10.0 100.4 0.7X
36-
Output Int and String Column 3815 3875 85 4.1 242.5 0.3X
37-
Output Partitions 2510 2511 1 6.3 159.6 0.4X
38-
Output Buckets 3441 3471 43 4.6 218.7 0.3X
34+
Output Single Int Column 1034 1039 7 15.2 65.8 1.0X
35+
Output Single Double Column 1687 1691 7 9.3 107.2 0.6X
36+
Output Int and String Column 3941 3955 20 4.0 250.6 0.3X
37+
Output Partitions 2553 2674 172 6.2 162.3 0.4X
38+
Output Buckets 3544 3548 6 4.4 225.3 0.3X
3939

4040

4141
================================================================================================
4242
JSON writer benchmark
4343
================================================================================================
4444

45-
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
45+
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
4646
AMD EPYC 7763 64-Core Processor
4747
JSON writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
4848
------------------------------------------------------------------------------------------------------------------------
49-
Output Single Int Column 1635 1639 5 9.6 104.0 1.0X
50-
Output Single Double Column 2218 2230 17 7.1 141.0 0.7X
51-
Output Int and String Column 3948 3997 68 4.0 251.0 0.4X
52-
Output Partitions 3165 3240 105 5.0 201.2 0.5X
53-
Output Buckets 4132 4142 15 3.8 262.7 0.4X
49+
Output Single Int Column 1669 1686 24 9.4 106.1 1.0X
50+
Output Single Double Column 2342 2369 37 6.7 148.9 0.7X
51+
Output Int and String Column 3776 3805 42 4.2 240.0 0.4X
52+
Output Partitions 3060 3064 7 5.1 194.5 0.5X
53+
Output Buckets 4009 4052 60 3.9 254.9 0.4X
5454

5555

5656
================================================================================================
5757
CSV writer benchmark
5858
================================================================================================
5959

60-
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1018-azure
60+
OpenJDK 64-Bit Server VM 17.0.11+9-LTS on Linux 6.5.0-1022-azure
6161
AMD EPYC 7763 64-Core Processor
6262
CSV writer benchmark: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
6363
------------------------------------------------------------------------------------------------------------------------
64-
Output Single Int Column 3680 3696 22 4.3 234.0 1.0X
65-
Output Single Double Column 3554 3559 7 4.4 225.9 1.0X
66-
Output Int and String Column 6396 6402 9 2.5 406.6 0.6X
67-
Output Partitions 4937 4942 7 3.2 313.9 0.7X
68-
Output Buckets 6288 6300 17 2.5 399.8 0.6X
64+
Output Single Int Column 3877 3889 18 4.1 246.5 1.0X
65+
Output Single Double Column 4079 4086 10 3.9 259.3 1.0X
66+
Output Int and String Column 6266 6269 4 2.5 398.4 0.6X
67+
Output Partitions 5432 5438 8 2.9 345.4 0.7X
68+
Output Buckets 6528 6530 4 2.4 415.0 0.6X
6969

7070

0 commit comments

Comments
 (0)