Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[opt](parquet) change parquet init footer read size to 48KB #46904

Merged
merged 1 commit into from
Jan 16, 2025

Conversation

morningman
Copy link
Contributor

@morningman morningman commented Jan 13, 2025

What problem does this PR solve?

Change the initial footer read size from 128KB to 48KB, to slightly reduce the read size.
This is same as presto/trino, because typically, a 1GB parquet file usually has footer with size 30~40KB.

And usercase shows when there are 30 thousands parquet file, the parse footer time can reduce from:

ParseFooterTime:  avg  2s28ms,  max  3s707ms,  min  905.866ms

to

ParseFooterTime:  avg  886.364ms,  max  1s734ms,  min  391.846ms

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Jan 13, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Copy link
Contributor

@kaka11chen kaka11chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 13, 2025
Copy link
Contributor

PR approved by at least one committer and no changes requested.

Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32614 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 201efdbbd1b4dbebe8b8a70c4eced5c7f8fff439, data reload: false

------ Round 1 ----------------------------------
q1	17616	6489	6018	6018
q2	2056	304	183	183
q3	10406	1206	751	751
q4	10241	858	435	435
q5	7921	2170	1954	1954
q6	209	179	161	161
q7	885	754	594	594
q8	9235	1360	1152	1152
q9	5122	4905	4938	4905
q10	6745	2316	1847	1847
q11	466	279	264	264
q12	336	352	229	229
q13	17761	3713	3093	3093
q14	236	222	219	219
q15	572	512	505	505
q16	637	625	590	590
q17	556	845	309	309
q18	6932	6500	6360	6360
q19	1528	968	554	554
q20	305	328	195	195
q21	2918	2203	1985	1985
q22	375	341	311	311
Total cold run time: 103058 ms
Total hot run time: 32614 ms

----- Round 2, with runtime_filter_mode=off -----
q1	6262	6217	6250	6217
q2	242	332	237	237
q3	2276	2635	2299	2299
q4	1416	1779	1329	1329
q5	4328	4686	4818	4686
q6	180	180	144	144
q7	2070	1968	1794	1794
q8	2627	2854	2762	2762
q9	7219	7314	7246	7246
q10	3048	3325	2704	2704
q11	590	534	527	527
q12	684	794	649	649
q13	3473	3896	3240	3240
q14	282	299	293	293
q15	567	516	512	512
q16	661	689	663	663
q17	1237	1762	1262	1262
q18	7701	7549	7337	7337
q19	815	946	1163	946
q20	1993	2095	1863	1863
q21	5714	5182	4962	4962
q22	629	594	587	587
Total cold run time: 54014 ms
Total hot run time: 52259 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 194500 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 201efdbbd1b4dbebe8b8a70c4eced5c7f8fff439, data reload: false

query1	1320	977	928	928
query2	6462	2432	2354	2354
query3	11121	4755	4910	4755
query4	32884	23494	23699	23494
query5	4241	618	472	472
query6	293	203	188	188
query7	4009	483	302	302
query8	300	230	217	217
query9	9440	2668	2676	2668
query10	467	298	256	256
query11	17641	15209	14832	14832
query12	150	119	106	106
query13	1548	495	381	381
query14	9116	6540	7727	6540
query15	234	210	187	187
query16	8220	657	466	466
query17	1527	781	589	589
query18	2102	410	324	324
query19	223	186	178	178
query20	127	114	116	114
query21	205	126	106	106
query22	4497	4585	4452	4452
query23	34121	33058	33141	33058
query24	6528	2331	2363	2331
query25	529	485	424	424
query26	818	276	160	160
query27	2034	483	333	333
query28	5879	2477	2441	2441
query29	641	560	429	429
query30	230	185	155	155
query31	949	869	867	867
query32	84	60	58	58
query33	484	370	313	313
query34	746	886	520	520
query35	800	840	744	744
query36	993	1044	956	956
query37	123	108	79	79
query38	4070	4132	4122	4122
query39	1504	1508	1461	1461
query40	204	117	107	107
query41	52	57	51	51
query42	122	103	100	100
query43	536	522	484	484
query44	1344	831	833	831
query45	176	173	169	169
query46	884	1063	661	661
query47	1941	1908	1845	1845
query48	376	423	314	314
query49	720	498	388	388
query50	646	705	422	422
query51	7203	7042	7001	7001
query52	108	111	103	103
query53	234	260	216	216
query54	498	493	419	419
query55	81	85	82	82
query56	269	266	255	255
query57	1187	1221	1141	1141
query58	250	235	229	229
query59	3340	3206	3153	3153
query60	289	279	261	261
query61	117	116	121	116
query62	891	791	727	727
query63	223	193	191	191
query64	3501	1105	738	738
query65	3253	3215	3247	3215
query66	804	448	330	330
query67	16420	15737	15475	15475
query68	9008	698	517	517
query69	497	284	268	268
query70	1190	1139	1071	1071
query71	437	283	263	263
query72	6515	3886	3894	3886
query73	656	758	353	353
query74	10623	9261	8734	8734
query75	4679	3125	2651	2651
query76	4318	1266	753	753
query77	835	394	269	269
query78	10025	10034	9348	9348
query79	3682	806	583	583
query80	718	521	442	442
query81	483	274	243	243
query82	649	149	125	125
query83	199	167	160	160
query84	289	97	76	76
query85	729	343	298	298
query86	356	323	300	300
query87	4626	4257	4360	4257
query88	4266	2171	2141	2141
query89	414	339	290	290
query90	1864	193	189	189
query91	132	133	111	111
query92	68	57	53	53
query93	1989	873	534	534
query94	659	389	297	297
query95	333	267	254	254
query96	484	607	278	278
query97	2882	2949	2798	2798
query98	224	202	194	194
query99	1631	1510	1380	1380
Total cold run time: 297789 ms
Total hot run time: 194500 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 32.11 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 201efdbbd1b4dbebe8b8a70c4eced5c7f8fff439, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.04
query3	0.23	0.07	0.07
query4	1.60	0.11	0.11
query5	0.41	0.39	0.38
query6	1.16	0.66	0.64
query7	0.03	0.02	0.02
query8	0.04	0.04	0.04
query9	0.58	0.51	0.51
query10	0.56	0.57	0.57
query11	0.15	0.10	0.10
query12	0.14	0.11	0.12
query13	0.60	0.60	0.60
query14	2.72	2.87	2.73
query15	0.89	0.84	0.83
query16	0.40	0.38	0.40
query17	1.06	1.02	1.07
query18	0.24	0.21	0.20
query19	1.97	1.88	2.01
query20	0.01	0.01	0.01
query21	15.37	1.01	0.60
query22	0.75	0.75	0.70
query23	15.27	1.44	0.54
query24	2.94	1.52	1.94
query25	0.15	0.06	0.13
query26	0.23	0.15	0.13
query27	0.07	0.04	0.04
query28	14.65	1.51	1.06
query29	12.61	4.08	3.43
query30	0.25	0.08	0.06
query31	2.84	0.60	0.38
query32	3.24	0.55	0.46
query33	3.09	3.21	3.09
query34	16.77	5.17	4.52
query35	4.56	4.53	4.47
query36	0.74	0.52	0.48
query37	0.10	0.06	0.07
query38	0.05	0.04	0.03
query39	0.04	0.02	0.02
query40	0.16	0.13	0.14
query41	0.08	0.02	0.02
query42	0.03	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 106.93 s
Total hot run time: 32.11 s

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 40.39% (10529/26071)
Line Coverage: 31.12% (89145/286494)
Region Coverage: 30.23% (45575/150773)
Branch Coverage: 26.52% (23160/87346)
Coverage Report: http://coverage.selectdb-in.cc/coverage/201efdbbd1b4dbebe8b8a70c4eced5c7f8fff439_201efdbbd1b4dbebe8b8a70c4eced5c7f8fff439/report/index.html

Copy link
Contributor

@suxiaogang223 suxiaogang223 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@morningman morningman merged commit c16567e into apache:master Jan 16, 2025
30 of 32 checks passed
morningman added a commit that referenced this pull request Feb 17, 2025
### What problem does this PR solve?

Change the initial footer read size from 128KB to 48KB, to slightly
reduce the read size.
This is same as presto/trino, because typically, a 1GB parquet file
usually has footer with size 30~40KB.

And usercase shows when there are 30 thousands parquet file, the parse
footer time can reduce from:

```
ParseFooterTime:  avg  2s28ms,  max  3s707ms,  min  905.866ms
```
to
```
ParseFooterTime:  avg  886.364ms,  max  1s734ms,  min  391.846ms
```
lzyy2024 pushed a commit to lzyy2024/doris that referenced this pull request Feb 21, 2025
…6904)

### What problem does this PR solve?

Change the initial footer read size from 128KB to 48KB, to slightly
reduce the read size.
This is same as presto/trino, because typically, a 1GB parquet file
usually has footer with size 30~40KB.

And usercase shows when there are 30 thousands parquet file, the parse
footer time can reduce from:

```
ParseFooterTime:  avg  2s28ms,  max  3s707ms,  min  905.866ms
```
to
```
ParseFooterTime:  avg  886.364ms,  max  1s734ms,  min  391.846ms
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants