Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refactor](jni) unified jni framework for jdbc catalog #26317

Merged
merged 1 commit into from
Nov 13, 2023

Conversation

zy-kkk
Copy link
Member

@zy-kkk zy-kkk commented Nov 2, 2023

Proposed changes

Issue Number: close #xxx

This commit overhauls the JDBC connector logic within our project, transitioning from the previous mechanism of fetching data through JNI calls for individual ResultSet items to a more efficient and unified approach using the VectorTable data structure.

Key Changes:

  • Old Approach: Data was initially read from the JDBC ResultSet into a List<Object[]>. Then, our C++ code within jdbc_connector had to read data from each column separately, making JNI calls to Java for necessary data format conversions and alignment for memory access.
  • New Approach with VectorTable: All data conversion is now handled in Java. The C++ side only requires knowledge of the VectorTable memory address to directly fill the Block structure.

Benefits:

  • Simplified Logic: The introduction of VectorTable significantly declutters the data fetching and conversion logic on the Backend (BE) side for JDBC Catalog. C++ code is now cleaner and more maintainable, with a clear separation of responsibilities.
  • Enhanced Efficiency: Data transformation is localized within the Java layer, reducing the overhead of context switching between Java and C++ and minimizing JNI overhead.
  • Improved Performance: Directly accessing VectorTable allows for faster data processing, better memory management, and overall improved read performance from JDBC sources.

This refactor is part of our ongoing commitment to improve the architecture and performance of our system. By streamlining the data path from JDBC to our internal data structures, we expect to see more responsive data operations and lower latency in data handling tasks.

Impact:

The changes are expected to be backward compatible with existing JDBC data sources. However, thorough testing is recommended to ensure that all edge cases are handled correctly.

We encourage the community to test this new implementation and provide feedback on any issues or performance improvements observed.

Pending Tasks:

  • Optimization for Specific Data Types: Currently, data types such as JSONB, bitmap, and HLL are handled through a process where they are first read into a String Column and then cast appropriately. This is an intermediate solution, and there is an ongoing effort to find a more optimized, memory-aligned method that would eliminate the need for casting within the C++ layer.

We aim to address this in future updates to further enhance the efficiency and performance of our system when dealing with these complex data types. By doing so, we anticipate reducing the overhead and potential bottlenecks associated with the current casting process, thus streamlining the entire data flow from JDBC to our internal representations.

Community involvement is crucial in this phase. We welcome contributions and suggestions on how to best approach this optimization for specialized data types. Your input will be invaluable in shaping the next iteration of our JDBC connector.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

Copy link
Contributor

github-actions bot commented Nov 2, 2023

clang-tidy review says "All clean, LGTM! 👍"

@zy-kkk zy-kkk force-pushed the unified_jni_fra_for_jdbc_catalog branch from 321f2a8 to 195d61c Compare November 3, 2023 09:43
@zy-kkk zy-kkk marked this pull request as ready for review November 3, 2023 09:43
@zy-kkk
Copy link
Member Author

zy-kkk commented Nov 3, 2023

run buildall

Copy link
Contributor

github-actions bot commented Nov 3, 2023

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.29% (8451/22663)
Line Coverage: 29.71% (68432/230304)
Region Coverage: 28.35% (35411/124911)
Branch Coverage: 25.09% (18075/72028)
Coverage Report: http://coverage.selectdb-in.cc/coverage/195d61c36cb888a6c7b89ec6543f95bda525d342_195d61c36cb888a6c7b89ec6543f95bda525d342/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.48 seconds
stream load tsv: 574 seconds loaded 74807831229 Bytes, about 124 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162343926 Bytes

@zy-kkk zy-kkk force-pushed the unified_jni_fra_for_jdbc_catalog branch from 195d61c to 51069d7 Compare November 6, 2023 10:00
@zy-kkk
Copy link
Member Author

zy-kkk commented Nov 6, 2023

run buildall

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.00% (8391/22677)
Line Coverage: 29.47% (67902/230417)
Region Coverage: 28.12% (35135/124944)
Branch Coverage: 24.92% (17952/72026)
Coverage Report: http://coverage.selectdb-in.cc/coverage/51069d76964512c4d29291020944f74247415593_51069d76964512c4d29291020944f74247415593/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.49 seconds
stream load tsv: 554 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162134165 Bytes

@zy-kkk zy-kkk force-pushed the unified_jni_fra_for_jdbc_catalog branch from 51069d7 to 129f3a1 Compare November 6, 2023 10:59
@zy-kkk
Copy link
Member Author

zy-kkk commented Nov 6, 2023

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 37.01% (8393/22677)
Line Coverage: 29.48% (67927/230417)
Region Coverage: 28.13% (35152/124944)
Branch Coverage: 24.93% (17953/72026)
Coverage Report: http://coverage.selectdb-in.cc/coverage/129f3a1b2ee51345108ff4fb66a40c8642567ad7_129f3a1b2ee51345108ff4fb66a40c8642567ad7/report/index.html

AshinGau
AshinGau previously approved these changes Nov 6, 2023
@AshinGau
Copy link
Member

AshinGau commented Nov 6, 2023

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 6, 2023
Copy link
Contributor

github-actions bot commented Nov 6, 2023

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented Nov 6, 2023

PR approved by anyone and no changes requested.

@AshinGau
Copy link
Member

AshinGau commented Nov 6, 2023

Please finish the Proposed changes in detail.

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 46.56 seconds
stream load tsv: 557 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.1 seconds inserted 10000000 Rows, about 343K ops/s
storage size: 17162410900 Bytes

chenlinzhong
chenlinzhong previously approved these changes Nov 7, 2023
Copy link
Contributor

@chenlinzhong chenlinzhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zy-kkk zy-kkk dismissed stale reviews from chenlinzhong and AshinGau via 9fd3ecd November 7, 2023 03:26
@zy-kkk zy-kkk force-pushed the unified_jni_fra_for_jdbc_catalog branch from 129f3a1 to 9fd3ecd Compare November 7, 2023 03:26
@zy-kkk
Copy link
Member Author

zy-kkk commented Nov 7, 2023

run buildall

@github-actions github-actions bot removed the approved Indicates a PR has been approved by one committer. label Nov 7, 2023
@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.99% (8390/22679)
Line Coverage: 29.47% (67920/230467)
Region Coverage: 28.12% (35140/124950)
Branch Coverage: 24.93% (17960/72044)
Coverage Report: http://coverage.selectdb-in.cc/coverage/9fd3ecdf24ca691d51e37b622374849aa95a14c5_9fd3ecdf24ca691d51e37b622374849aa95a14c5/report/index.html

@BePPPower
Copy link
Contributor

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.75% (8401/22857)
Line Coverage: 29.29% (68122/232585)
Region Coverage: 27.91% (35222/126194)
Branch Coverage: 24.72% (18000/72810)
Coverage Report: http://coverage.selectdb-in.cc/coverage/39b249522f347ccdf57ad3808bc15f52dc4e3486_39b249522f347ccdf57ad3808bc15f52dc4e3486/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 45.06 seconds
stream load tsv: 553 seconds loaded 74807831229 Bytes, about 129 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.8 seconds inserted 10000000 Rows, about 347K ops/s
storage size: 17162400956 Bytes

@zy-kkk zy-kkk added the not-merge/2.0 do not merge into 2.0 branch label Nov 9, 2023
@zy-kkk zy-kkk force-pushed the unified_jni_fra_for_jdbc_catalog branch from 646a9cb to 52a9f0c Compare November 10, 2023 07:27
@zy-kkk
Copy link
Member Author

zy-kkk commented Nov 10, 2023

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.77% (8404/22856)
Line Coverage: 29.33% (68149/232358)
Region Coverage: 27.93% (35231/126121)
Branch Coverage: 24.77% (18027/72790)
Coverage Report: http://coverage.selectdb-in.cc/coverage/52a9f0c83ba59e09d724c2cb01afdc04f8cce7dc_52a9f0c83ba59e09d724c2cb01afdc04f8cce7dc/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.6 seconds
stream load tsv: 556 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 28.7 seconds inserted 10000000 Rows, about 348K ops/s
storage size: 17162459609 Bytes

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 52a9f0c83ba59e09d724c2cb01afdc04f8cce7dc, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5256	5029	5006	5006
q2	388	209	204	204
q3	2075	2116	2071	2071
q4	1487	1484	1464	1464
q5	4122	4161	4116	4116
q6	255	134	136	134
q7	2113	1594	1599	1594
q8	2786	2774	2776	2774
q9	10369	10421	10254	10254
q10	3499	3563	3550	3550
q11	374	259	244	244
q12	465	297	295	295
q13	4525	4094	4076	4076
q14	328	283	300	283
q15	656	582	566	566
q16	701	616	595	595
q17	1162	1105	1100	1100
q18	7805	7515	7328	7328
q19	1717	1723	1688	1688
q20	596	359	347	347
q21	4970	4609	4626	4609
q22	539	427	458	427
Total cold run time: 56188 ms
Total hot run time: 52725 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4928	4980	4939	4939
q2	367	260	251	251
q3	4083	3951	3935	3935
q4	2777	2769	2762	2762
q5	6527	6488	6503	6488
q6	249	127	132	127
q7	3124	2755	2701	2701
q8	4779	4789	4811	4789
q9	17726	17748	17601	17601
q10	4084	4163	4122	4122
q11	717	653	664	653
q12	1034	819	819	819
q13	4291	3898	3885	3885
q14	379	370	351	351
q15	667	572	554	554
q16	775	675	662	662
q17	3869	3948	4006	3948
q18	9456	9064	9250	9064
q19	1952	1774	1787	1774
q20	2393	2063	2061	2061
q21	8850	8665	8788	8665
q22	954	866	839	839
Total cold run time: 83981 ms
Total hot run time: 80990 ms

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 10, 2023
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@zy-kkk
Copy link
Member Author

zy-kkk commented Nov 10, 2023

run buildall

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 52a9f0c83ba59e09d724c2cb01afdc04f8cce7dc, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5291	5085	5099	5085
q2	371	184	198	184
q3	2086	2109	2062	2062
q4	1470	1427	1444	1427
q5	4145	4189	4126	4126
q6	258	133	139	133
q7	2088	1603	1629	1603
q8	2793	2765	2757	2757
q9	10719	10315	10326	10315
q10	3477	3561	3594	3561
q11	386	253	256	253
q12	467	295	303	295
q13	4554	4053	4110	4053
q14	319	285	287	285
q15	632	565	574	565
q16	702	615	600	600
q17	1151	1115	1105	1105
q18	7874	7353	7433	7353
q19	1725	1700	1712	1700
q20	593	381	354	354
q21	4931	4577	4581	4577
q22	534	410	446	410
Total cold run time: 56566 ms
Total hot run time: 52803 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	4970	5049	5069	5049
q2	324	256	246	246
q3	3987	4024	3999	3999
q4	2788	2758	2749	2749
q5	6568	6480	6503	6480
q6	249	131	131	131
q7	3175	2694	2761	2694
q8	4771	4800	4755	4755
q9	17823	17572	17786	17572
q10	4090	4148	4200	4148
q11	721	649	654	649
q12	1025	823	829	823
q13	4331	3924	3921	3921
q14	392	354	343	343
q15	632	568	559	559
q16	774	747	731	731
q17	3888	3869	3917	3869
q18	9331	9219	9214	9214
q19	1877	1793	1807	1793
q20	2381	2055	2040	2040
q21	8823	8825	8948	8825
q22	982	881	858	858
Total cold run time: 83902 ms
Total hot run time: 81448 ms

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.76% (8403/22856)
Line Coverage: 29.32% (68130/232358)
Region Coverage: 27.93% (35223/126121)
Branch Coverage: 24.75% (18018/72790)
Coverage Report: http://coverage.selectdb-in.cc/coverage/52a9f0c83ba59e09d724c2cb01afdc04f8cce7dc_52a9f0c83ba59e09d724c2cb01afdc04f8cce7dc/report/index.html

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.93 seconds
stream load tsv: 560 seconds loaded 74807831229 Bytes, about 127 MB/s
stream load json: 21 seconds loaded 2358488459 Bytes, about 107 MB/s
stream load orc: 66 seconds loaded 1101869774 Bytes, about 15 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 29.1 seconds inserted 10000000 Rows, about 343K ops/s
storage size: 17162787818 Bytes

@zy-kkk
Copy link
Member Author

zy-kkk commented Nov 12, 2023

run buildall

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 52a9f0c83ba59e09d724c2cb01afdc04f8cce7dc, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5356	5086	5120	5086
q2	378	204	202	202
q3	2081	2083	2048	2048
q4	1479	1451	1423	1423
q5	4156	4173	4086	4086
q6	259	134	135	134
q7	2095	1598	1625	1598
q8	2754	2742	2759	2742
q9	10406	10305	10270	10270
q10	3484	3570	3581	3570
q11	381	259	271	259
q12	484	301	310	301
q13	4537	4121	4142	4121
q14	330	296	287	287
q15	658	568	562	562
q16	711	620	596	596
q17	1141	1093	1082	1082
q18	7722	7385	7417	7385
q19	1687	1714	1713	1713
q20	572	356	370	356
q21	4944	4565	4558	4558
q22	545	458	431	431
Total cold run time: 56160 ms
Total hot run time: 52810 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	5102	5103	4967	4967
q2	348	261	246	246
q3	4161	3955	3983	3955
q4	2813	2784	2746	2746
q5	6516	6493	6494	6493
q6	248	129	129	129
q7	3215	2681	2722	2681
q8	4837	4787	4795	4787
q9	17735	17785	17683	17683
q10	4055	4144	4134	4134
q11	766	649	632	632
q12	1009	855	800	800
q13	4302	3956	3906	3906
q14	384	347	361	347
q15	631	572	547	547
q16	774	705	707	705
q17	3936	3915	3953	3915
q18	9382	9249	9239	9239
q19	1802	1773	1762	1762
q20	2395	2056	2069	2056
q21	8999	8861	8834	8834
q22	946	836	873	836
Total cold run time: 84356 ms
Total hot run time: 81400 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.48 seconds
stream load tsv: 551 seconds loaded 74807831229 Bytes, about 129 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 33 seconds loaded 861443392 Bytes, about 24 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162276403 Bytes

@zy-kkk zy-kkk force-pushed the unified_jni_fra_for_jdbc_catalog branch from 52a9f0c to 25c4924 Compare November 13, 2023 00:59
@zy-kkk
Copy link
Member Author

zy-kkk commented Nov 13, 2023

run buildall

@doris-robot
Copy link

TPC-H test result on machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
Tpch sf100 test result on commit 25c49242d9cab892615269e9bf26c7dbb3684cfa, data reload: false

run tpch-sf100 query with default conf and session variables
q1	5270	5120	5307	5120
q2	372	199	199	199
q3	2093	2081	2035	2035
q4	1472	1432	1419	1419
q5	4142	4176	4085	4085
q6	253	133	138	133
q7	2081	1592	1604	1592
q8	2772	2750	2769	2750
q9	10405	10446	10301	10301
q10	3514	3554	3563	3554
q11	370	255	253	253
q12	456	303	302	302
q13	4519	4129	4104	4104
q14	329	281	295	281
q15	626	580	590	580
q16	703	621	594	594
q17	1143	1081	1093	1081
q18	7834	7394	7484	7394
q19	1705	1718	1700	1700
q20	606	376	353	353
q21	4945	4587	4554	4554
q22	522	444	429	429
Total cold run time: 56132 ms
Total hot run time: 52813 ms

run tpch-sf100 query with default conf and set session variable runtime_filter_mode=off
q1	5070	5042	4978	4978
q2	360	232	237	232
q3	4055	3929	3975	3929
q4	2757	2752	2740	2740
q5	6502	6487	6469	6469
q6	246	130	129	129
q7	3115	2721	2749	2721
q8	4808	4789	4762	4762
q9	17752	17672	17721	17672
q10	4086	4172	4126	4126
q11	719	622	613	613
q12	1004	808	814	808
q13	4301	3902	3901	3901
q14	387	353	370	353
q15	646	588	584	584
q16	791	716	716	716
q17	3934	3856	3903	3856
q18	9468	9139	9338	9139
q19	1859	1779	1783	1779
q20	2388	2043	2018	2018
q21	8837	8972	8703	8703
q22	927	872	898	872
Total cold run time: 84012 ms
Total hot run time: 81100 ms

@doris-robot
Copy link

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 44.71 seconds
stream load tsv: 555 seconds loaded 74807831229 Bytes, about 128 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 32 seconds loaded 861443392 Bytes, about 25 MB/s
insert into select: 29.0 seconds inserted 10000000 Rows, about 344K ops/s
storage size: 17162329434 Bytes

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 36.77% (8404/22853)
Line Coverage: 29.34% (68205/232474)
Region Coverage: 27.92% (35221/126143)
Branch Coverage: 24.74% (18007/72776)
Coverage Report: http://coverage.selectdb-in.cc/coverage/25c49242d9cab892615269e9bf26c7dbb3684cfa_25c49242d9cab892615269e9bf26c7dbb3684cfa/report/index.html

Copy link
Contributor

@LemonLiTree LemonLiTree left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AshinGau AshinGau merged commit 2f32a72 into apache:master Nov 13, 2023
@zy-kkk zy-kkk deleted the unified_jni_fra_for_jdbc_catalog branch November 13, 2023 15:08
seawinde pushed a commit to seawinde/doris that referenced this pull request Nov 14, 2023
This commit overhauls the JDBC connector logic within our project, transitioning from the previous mechanism of fetching data through JNI calls for individual ResultSet items to a more efficient and unified approach using the VectorTable data structure.
XuJianxu pushed a commit to XuJianxu/doris that referenced this pull request Dec 14, 2023
This commit overhauls the JDBC connector logic within our project, transitioning from the previous mechanism of fetching data through JNI calls for individual ResultSet items to a more efficient and unified approach using the VectorTable data structure.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. not-merge/2.0 do not merge into 2.0 branch reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants