Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](parquet) return error if schema changed in complex types #31128

Merged
merged 1 commit into from
Feb 19, 2024

Conversation

AshinGau
Copy link
Member

@AshinGau AshinGau commented Feb 19, 2024

Proposed changes

Check the column type of complex type to prevent core dump in BE. ColumnReader will throw segmentation fault in the following case:
Change complex types in hive:

hive> create table struct_test(
           id int,
           sf struct<f1: int, f2: map<string, string>>) stored as parquet;

hive> insert into struct_test values
          (1, named_struct('f1', 1, 'f2', str_to_map('1:s2,2:s2'))),
          (2, named_struct('f1', 2, 'f2', str_to_map('k1:s3,k2:s4'))),
          (3, named_struct('f1', 3, 'f2', str_to_map('k1:s5,k2:s6')));

hive> alter table struct_test change sf sf struct<f1:int, f2: string>;

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

@AshinGau
Copy link
Member Author

run buildall

Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@wm1581066 wm1581066 added the usercase Important user case type label label Feb 19, 2024
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@doris-robot
Copy link

TPC-H: Total hot run time: 41832 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit d354ba97ee6cea62ec4c90bd0e7936df67b33fc4, data reload: false

------ Round 1 ----------------------------------
q1	17601	4968	4904	4904
q2	2046	143	138	138
q3	10902	1043	1050	1043
q4	4971	1021	1018	1018
q5	8044	3326	3380	3326
q6	204	142	141	141
q7	1289	800	797	797
q8	9923	2092	2080	2080
q9	7678	6697	6677	6677
q10	8318	2637	2644	2637
q11	429	220	217	217
q12	734	327	335	327
q13	17991	3653	3678	3653
q14	300	266	262	262
q15	591	540	495	495
q16	476	435	415	415
q17	935	842	862	842
q18	7412	6772	6707	6707
q19	1521	1511	1495	1495
q20	596	359	351	351
q21	6475	3997	3961	3961
q22	868	347	346	346
Total cold run time: 109304 ms
Total hot run time: 41832 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4863	4832	4834	4832
q2	293	195	187	187
q3	3612	3602	3622	3602
q4	2589	2586	2561	2561
q5	5760	5748	5795	5748
q6	214	126	128	126
q7	2260	1708	1726	1708
q8	3024	3084	3111	3084
q9	8729	8747	8675	8675
q10	6910	4277	4251	4251
q11	530	377	372	372
q12	772	555	555	555
q13	4193	3443	3411	3411
q14	266	251	231	231
q15	602	502	497	497
q16	491	440	445	440
q17	1689	1624	1617	1617
q18	8302	7565	7527	7527
q19	1635	1656	1647	1647
q20	2131	1847	1820	1820
q21	6732	6353	6323	6323
q22	591	512	535	512
Total cold run time: 66188 ms
Total hot run time: 59726 ms

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Feb 19, 2024
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.77% (8558/23924)
Line Coverage: 27.72% (69393/250373)
Region Coverage: 26.84% (36018/134186)
Branch Coverage: 23.65% (18419/77890)
Coverage Report: http://coverage.selectdb-in.cc/coverage/d354ba97ee6cea62ec4c90bd0e7936df67b33fc4_d354ba97ee6cea62ec4c90bd0e7936df67b33fc4/report/index.html

@doris-robot
Copy link

TPC-DS: Total hot run time: 177438 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit d354ba97ee6cea62ec4c90bd0e7936df67b33fc4, data reload: false

query1	924	347	344	344
query2	6512	1663	1632	1632
query3	6703	207	218	207
query4	23124	21106	21072	21072
query5	4341	391	454	391
query6	266	183	181	181
query7	4615	289	311	289
query8	246	204	209	204
query9	8441	2821	2818	2818
query10	414	229	238	229
query11	15007	14520	14435	14435
query12	148	86	85	85
query13	1696	415	429	415
query14	9391	7547	7740	7547
query15	209	200	189	189
query16	7581	260	256	256
query17	1397	571	549	549
query18	1967	267	261	261
query19	195	148	149	148
query20	86	79	86	79
query21	188	123	128	123
query22	4864	4849	4764	4764
query23	32689	31657	31465	31465
query24	13052	3512	3430	3430
query25	664	371	365	365
query26	1897	155	177	155
query27	3033	315	318	315
query28	6504	1845	1843	1843
query29	1176	623	613	613
query30	286	138	149	138
query31	935	739	772	739
query32	93	64	55	55
query33	738	233	236	233
query34	1087	494	502	494
query35	947	829	822	822
query36	998	905	874	874
query37	263	59	61	59
query38	3273	3211	3204	3204
query39	1365	1328	1332	1328
query40	284	102	104	102
query41	38	36	36	36
query42	106	96	98	96
query43	473	447	457	447
query44	1067	689	702	689
query45	204	186	174	174
query46	1037	789	772	772
query47	1637	1524	1635	1524
query48	423	354	348	348
query49	1232	314	307	307
query50	789	394	374	374
query51	5271	5163	5246	5163
query52	110	89	93	89
query53	388	293	302	293
query54	291	218	226	218
query55	84	90	77	77
query56	212	196	204	196
query57	1038	974	939	939
query58	224	196	202	196
query59	2232	2114	2141	2114
query60	236	213	221	213
query61	88	84	81	81
query62	588	375	356	356
query63	331	284	286	284
query64	6134	3064	3149	3064
query65	3323	3280	3237	3237
query66	1324	342	328	328
query67	14613	14548	14255	14255
query68	5100	539	567	539
query69	509	352	356	352
query70	1326	1236	1185	1185
query71	363	266	251	251
query72	6302	2788	2606	2606
query73	701	310	310	310
query74	6868	6476	6537	6476
query75	3174	2588	2548	2548
query76	3318	1136	1205	1136
query77	366	243	232	232
query78	9404	8839	8729	8729
query79	973	502	508	502
query80	518	358	351	351
query81	397	204	206	204
query82	229	86	86	86
query83	140	129	127	127
query84	229	86	80	80
query85	753	356	335	335
query86	287	303	311	303
query87	3460	3270	3302	3270
query88	2719	2283	2283	2283
query89	432	375	349	349
query90	1936	168	165	165
query91	151	126	130	126
query92	56	50	47	47
query93	1012	529	510	510
query94	1211	184	185	184
query95	469	8633	373	373
query96	577	265	268	265
query97	4441	4266	4260	4260
query98	228	201	198	198
query99	1075	734	738	734
Total cold run time: 269816 ms
Total hot run time: 177438 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.5 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit d354ba97ee6cea62ec4c90bd0e7936df67b33fc4, data reload: false

query1	0.04	0.03	0.02
query2	0.06	0.02	0.02
query3	0.22	0.07	0.08
query4	1.65	0.11	0.10
query5	0.50	0.48	0.49
query6	1.36	0.61	0.62
query7	0.02	0.02	0.01
query8	0.04	0.03	0.03
query9	0.52	0.46	0.46
query10	0.47	0.49	0.48
query11	0.12	0.09	0.09
query12	0.13	0.10	0.10
query13	0.59	0.59	0.59
query14	0.76	0.79	0.79
query15	0.82	0.80	0.81
query16	0.33	0.32	0.32
query17	0.93	0.90	0.90
query18	0.16	0.18	0.19
query19	1.78	1.61	1.64
query20	0.01	0.02	0.02
query21	15.43	0.62	0.57
query22	2.79	3.60	1.88
query23	17.34	1.00	0.99
query24	2.00	0.37	0.57
query25	0.62	0.07	0.05
query26	0.16	0.15	0.14
query27	0.06	0.05	0.05
query28	12.03	0.87	0.82
query29	12.67	3.31	3.30
query30	0.55	0.52	0.45
query31	2.77	0.35	0.37
query32	3.34	0.47	0.48
query33	3.14	3.16	3.17
query34	15.35	4.51	4.48
query35	4.51	4.50	4.49
query36	1.07	0.95	0.95
query37	0.07	0.05	0.05
query38	0.03	0.03	0.03
query39	0.02	0.02	0.02
query40	0.17	0.15	0.14
query41	0.07	0.02	0.02
query42	0.02	0.03	0.02
query43	0.03	0.02	0.02
Total cold run time: 104.75 s
Total hot run time: 30.5 s

@doris-robot
Copy link

Load test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'

Load test result on commit d354ba97ee6cea62ec4c90bd0e7936df67b33fc4 with default session variables
Stream load json:         19 seconds loaded 2358488459 Bytes, about 118 MB/s
Stream load orc:          59 seconds loaded 1101869774 Bytes, about 17 MB/s
Stream load parquet:      31 seconds loaded 861443392 Bytes, about 26 MB/s
Insert into select:       14.3 seconds inserted 10000000 Rows, about 699K ops/s

@yiguolei yiguolei merged commit 0484d6b into apache:master Feb 19, 2024
31 of 34 checks passed
yiguolei pushed a commit that referenced this pull request Feb 20, 2024
Check the column type of complex type to prevent core dump in BE. ColumnReader will throw segmentation fault in the following case:
Change complex types in hive:

hive> create table struct_test(
           id int,
           sf struct<f1: int, f2: map<string, string>>) stored as parquet;

hive> insert into struct_test values
          (1, named_struct('f1', 1, 'f2', str_to_map('1:s2,2:s2'))),
          (2, named_struct('f1', 2, 'f2', str_to_map('k1:s3,k2:s4'))),
          (3, named_struct('f1', 3, 'f2', str_to_map('k1:s5,k2:s6')));

hive> alter table struct_test change sf sf struct<f1:int, f2: string>;
morningman pushed a commit that referenced this pull request Feb 22, 2024
morningman pushed a commit that referenced this pull request Feb 22, 2024
yiguolei pushed a commit that referenced this pull request Feb 22, 2024
morningman pushed a commit that referenced this pull request Mar 22, 2024
Followup: #31128
This optimization allows doris to correctly read struct type data after changing the schema from hive.

## Changing  struct schema  in hive:
```sql
hive> create table struct_test(id int,sf struct<f1: int, f2: string>) stored as parquet;

hive> insert into struct_test values
    >           (1, named_struct('f1', 1, 'f2', 's1')),
    >           (2, named_struct('f1', 2, 'f2', 's2')),
    >           (3, named_struct('f1', 3, 'f2', 's3'));

hive> alter table struct_test change sf sf struct<f1:int, f3:string>;

hive> select * from struct_test;
OK
1	{"f1":1,"f3":null}
2	{"f1":2,"f3":null}
3	{"f1":3,"f3":null}
Time taken: 5.298 seconds, Fetched: 3 row(s)
```

The previous result of doris was:
```sql
mysql> select * from struct_test;
+------+-----------------------+
| id   | sf                    |
+------+-----------------------+
|    1 | {"f1": 1, "f3": "s1"} |
|    2 | {"f1": 2, "f3": "s2"} |
|    3 | {"f1": 3, "f3": "s3"} |
+------+-----------------------+
```

Now the result is same as hive:

```sql
mysql> select * from struct_test;
+------+-----------------------+
| id   | sf                    |
+------+-----------------------+
|    1 | {"f1": 1, "f3": null} |
|    2 | {"f1": 2, "f3": null} |
|    3 | {"f1": 3, "f3": null} |
+------+-----------------------+
```
yiguolei pushed a commit that referenced this pull request Mar 22, 2024
Followup: #31128
This optimization allows doris to correctly read struct type data after changing the schema from hive.

## Changing  struct schema  in hive:
```sql
hive> create table struct_test(id int,sf struct<f1: int, f2: string>) stored as parquet;

hive> insert into struct_test values
    >           (1, named_struct('f1', 1, 'f2', 's1')),
    >           (2, named_struct('f1', 2, 'f2', 's2')),
    >           (3, named_struct('f1', 3, 'f2', 's3'));

hive> alter table struct_test change sf sf struct<f1:int, f3:string>;

hive> select * from struct_test;
OK
1	{"f1":1,"f3":null}
2	{"f1":2,"f3":null}
3	{"f1":3,"f3":null}
Time taken: 5.298 seconds, Fetched: 3 row(s)
```

The previous result of doris was:
```sql
mysql> select * from struct_test;
+------+-----------------------+
| id   | sf                    |
+------+-----------------------+
|    1 | {"f1": 1, "f3": "s1"} |
|    2 | {"f1": 2, "f3": "s2"} |
|    3 | {"f1": 3, "f3": "s3"} |
+------+-----------------------+
```

Now the result is same as hive:

```sql
mysql> select * from struct_test;
+------+-----------------------+
| id   | sf                    |
+------+-----------------------+
|    1 | {"f1": 1, "f3": null} |
|    2 | {"f1": 2, "f3": null} |
|    3 | {"f1": 3, "f3": null} |
+------+-----------------------+
```
mongo360 pushed a commit to mongo360/doris that referenced this pull request Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.5-merged reviewed usercase Important user case type label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants