Skip to content

Conversation

@dtenedor
Copy link
Contributor

@dtenedor dtenedor commented Jan 3, 2023

What changes were proposed in this pull request?

This PR fixes a correctness bug related to column DEFAULT values in Orc reader.

Why are the changes needed?

This PR fixes a correctness bug.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

This PR updates a unit test to exercise that the Orc scan functionality is correct.

@dtenedor dtenedor changed the title [SPARK-41862][SQL] Fix correctness bug related to DEFAULT values in Orc reader [SPARK-41862][SQL] Fix correctness bug related to DEFAULT values in Orc reader Jan 3, 2023
@github-actions github-actions bot added the SQL label Jan 3, 2023
@dtenedor
Copy link
Contributor Author

dtenedor commented Jan 3, 2023

@gengliangwang and @dongjoon-hyun FYI, here is the Orc fix.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @dtenedor . I'll verify the perf today from my side too.

test("INSERT rows, ALTER TABLE ADD COLUMNS with DEFAULTs, then SELECT them") {
case class Config(
sqlConf: Option[(String, String)],
insertNullsToStorage: Boolean = true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice clean up too.

@dtenedor
Copy link
Contributor Author

dtenedor commented Jan 3, 2023

Thank you, @dtenedor . I'll verify the perf today from my side too.

I re-ran the Orc benchmark with this PR, here it is for reference:

================================================================================================
SQL Single Numeric Column Scan
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
SQL Single TINYINT Column Scan:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1552           1609          80         10.1          98.7       1.0X
Native ORC MR                                      1383           1384           2         11.4          87.9       1.1X
Native ORC Vectorized                               196            245          56         80.3          12.4       7.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
SQL Single SMALLINT Column Scan:          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1651           1666          22          9.5         105.0       1.0X
Native ORC MR                                      1267           1269           4         12.4          80.5       1.3X
Native ORC Vectorized                               163            202          49         96.4          10.4      10.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
SQL Single INT Column Scan:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1776           1812          51          8.9         112.9       1.0X
Native ORC MR                                      1371           1399          40         11.5          87.2       1.3X
Native ORC Vectorized                               211            274          72         74.7          13.4       8.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
SQL Single BIGINT Column Scan:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1767           1915         209          8.9         112.4       1.0X
Native ORC MR                                      1334           1395          86         11.8          84.8       1.3X
Native ORC Vectorized                               224            302          74         70.2          14.2       7.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
SQL Single FLOAT Column Scan:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1795           1876         114          8.8         114.1       1.0X
Native ORC MR                                      1513           1523          14         10.4          96.2       1.2X
Native ORC Vectorized                               279            323          66         56.4          17.7       6.4X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
SQL Single DOUBLE Column Scan:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1713           1762          70          9.2         108.9       1.0X
Native ORC MR                                      1372           1398          38         11.5          87.2       1.2X
Native ORC Vectorized                               309            334          18         50.9          19.7       5.5X


================================================================================================
Int and String Scan
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Int and String Scan:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  3571           3711         197          2.9         340.6       1.0X
Native ORC MR                                      2750           2824         104          3.8         262.3       1.3X
Native ORC Vectorized                              1749           1827         111          6.0         166.8       2.0X


================================================================================================
Partitioned Table Scan
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Partitioned Table:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Data column - Hive built-in ORC                    1976           2115         197          8.0         125.6       1.0X
Data column - Native ORC MR                        1670           1769         140          9.4         106.2       1.2X
Data column - Native ORC Vectorized                 223            271          47         70.4          14.2       8.8X
Partition column - Hive built-in ORC               1352           1395          61         11.6          86.0       1.5X
Partition column - Native ORC MR                   1054           1113          84         14.9          67.0       1.9X
Partition column - Native ORC Vectorized             67            120          58        235.3           4.3      29.6X
Both columns - Hive built-in ORC                   2130           2139          13          7.4         135.4       0.9X
Both columns - Native ORC MR                       1642           1646           6          9.6         104.4       1.2X
Both columns - Native ORC Vectorized                235            253          15         66.9          14.9       8.4X


================================================================================================
Repeated String Scan
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Repeated String:                          Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1617           1685          96          6.5         154.2       1.0X
Native ORC MR                                      1292           1292           0          8.1         123.2       1.3X
Native ORC Vectorized                               243            255           8         43.1          23.2       6.6X


================================================================================================
String with Nulls Scan
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
String with Nulls Scan (0.0%):            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  2735           2816         114          3.8         260.8       1.0X
Native ORC MR                                      2157           2248         129          4.9         205.7       1.3X
Native ORC Vectorized                               677            686          15         15.5          64.5       4.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
String with Nulls Scan (50.0%):           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  2610           2620          14          4.0         248.9       1.0X
Native ORC MR                                      2134           2181          67          4.9         203.5       1.2X
Native ORC Vectorized                               962            983          32         10.9          91.8       2.7X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
String with Nulls Scan (95.0%):           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1606           1632          37          6.5         153.1       1.0X
Native ORC MR                                      1219           1239          28          8.6         116.3       1.3X
Native ORC Vectorized                               284            292           7         37.0          27.0       5.7X


================================================================================================
Single Column Scan From Wide Columns
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Single Column Scan from 100 columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  1384           1466         117          0.8        1319.7       1.0X
Native ORC MR                                       204            303          94          5.1         194.7       6.8X
Native ORC Vectorized                                99            116          17         10.6          94.7      13.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Single Column Scan from 200 columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  2313           2437         176          0.5        2205.5       1.0X
Native ORC MR                                       252            314          90          4.2         239.9       9.2X
Native ORC Vectorized                               177            280          97          5.9         169.0      13.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Single Column Scan from 300 columns:      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  3332           3456         175          0.3        3178.1       1.0X
Native ORC MR                                       336            348          12          3.1         320.2       9.9X
Native ORC Vectorized                               232            288          80          4.5         221.6      14.3X


================================================================================================
Struct scan
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Single Struct Column Scan with 10 Fields:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                   550            611          87          1.9         524.5       1.0X
Native ORC MR                                       456            488          38          2.3         434.7       1.2X
Native ORC Vectorized                               209            215           9          5.0         199.1       2.6X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Single Struct Column Scan with 100 Fields:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                   3925           4114         268          0.3        3743.0       1.0X
Native ORC MR                                       3635           3638           5          0.3        3466.4       1.1X
Native ORC Vectorized                               1840           1860          27          0.6        1755.2       2.1X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Single Struct Column Scan with 300 Fields:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  12554          12628         105          0.1       11972.9       1.0X
Native ORC MR                                      14810          14823          19          0.1       14123.6       0.8X
Native ORC Vectorized                              14279          14405         178          0.1       13617.3       0.9X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Single Struct Column Scan with 600 Fields:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
-------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                  26014          26116         145          0.0       24808.7       1.0X
Native ORC MR                                      39066          39790        1023          0.0       37256.4       0.7X
Native ORC Vectorized                              39048          39152         148          0.0       37238.7       0.7X


================================================================================================
Nested Struct scan
================================================================================================

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Nested Struct Scan with 10 Elements, 10 Fields:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                        4719           4800         115          0.2        4500.1       1.0X
Native ORC MR                                            5100           5214         161          0.2        4864.0       0.9X
Native ORC Vectorized                                    1181           1185           5          0.9        1126.6       4.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Nested Struct Scan with 30 Elements, 10 Fields:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                       12792          12976         260          0.1       12199.6       1.0X
Native ORC MR                                           12832          12952         170          0.1       12237.6       1.0X
Native ORC Vectorized                                    3209           3214           7          0.3        3060.1       4.0X

Java HotSpot(TM) 64-Bit Server VM 1.8.0_131-b11 on Mac OS X 10.16
Apple M1 Max
Nested Struct Scan with 10 Elements, 30 Fields:  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------------
Hive built-in ORC                                       10448          10517          98          0.1        9963.6       1.0X
Native ORC MR                                           13308          13308           0          0.1       12691.7       0.8X
Native ORC Vectorized                                    3292           3337          64          0.3        3139.1       3.2X


@HyukjinKwon
Copy link
Member

Merged to master.

@dongjoon-hyun
Copy link
Member

Thank you all. +1, late LGTM.

@LuciferYang
Copy link
Contributor

late LGTM

dongjoon-hyun added a commit that referenced this pull request Jan 4, 2023
### What changes were proposed in this pull request?

This PR is a follow-up of #39370 .

### Why are the changes needed?

To sync the patch with the recovered perf result.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual review.
- Java 8: https://github.com/dongjoon-hyun/spark/actions/runs/3834890434
- Java 11: https://github.com/dongjoon-hyun/spark/actions/runs/3834892478
- Java 17: https://github.com/dongjoon-hyun/spark/actions/runs/3834893844

Closes #39380 from dongjoon-hyun/SPARK-41862.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants