Skip to content

[NoMergeLatest](exit) call stop before brpc server stop to stop queries and allow brpc exist gracefully#54781

Closed
yiguolei wants to merge 1 commit intoapache:branch-3.1from
yiguolei:branch-3.1
Closed

[NoMergeLatest](exit) call stop before brpc server stop to stop queries and allow brpc exist gracefully#54781
yiguolei wants to merge 1 commit intoapache:branch-3.1from
yiguolei:branch-3.1

Conversation

@yiguolei
Copy link
Contributor

@yiguolei yiguolei commented Aug 14, 2025

What problem does this PR solve?

Thread 1 (Thread 0x7f2a03130040 (LWP 3064597) "doris_be"):
#0 futex_wait_cancelable (private=, expected=0, futex_word=0x612000433780) at ../sysdeps/nptl/futex-internal.h:183
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x612000433730, cond=0x612000433758) at pthread_cond_wait.c:508
#2 __pthread_cond_wait (cond=0x612000433758, mutex=0x612000433730) at pthread_cond_wait.c:647
#3 0x000055e4848e6a68 in brpc::Acceptor::Join() ()
#4 0x000055e4848d2cdd in brpc::Server::Join() ()
#5 0x000055e44aeec3d8 in doris::BRpcService::join (this=) at /root/doris/be/src/service/brpc_service.cpp:107
#6 0x000055e44aeec155 in doris::BRpcService::~BRpcService (this=0x612000433780) at /root/doris/be/src/service/brpc_service.cpp:59
#7 0x000055e446772f04 in std::default_deletedoris::BRpcService::operator() (this=, __ptr=0x6020005e64d0) at /var/local/ldb-toolchain-018/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:85
#8 std::__uniq_ptr_impl<doris::BRpcService, std::default_deletedoris::BRpcService >::reset (this=this@entry=0x7f2a0120f0c0, __p=0x80, __p@entry=0x0) at /var/local/ldb-toolchain-018/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:182
#9 0x000055e446747225 in std::unique_ptr<doris::BRpcService, std::default_deletedoris::BRpcService >::reset (__p=0x0, this=) at /var/local/ldb-toolchain-018/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:456
#10 main (argc=, argv=) at /root/doris/be/src/service/doris_main.cpp:631
Detaching from program: /mnt/ssd01/pipline/OpenSourceDoris/clusterEnv/P0/Cluster0/be/lib/doris_be, process 3064597
[Inferior 1 (process 3064597) detached]

When call doris be to exit gracefully, doris main function will blocked at brpc server's join method. It will try to wait all brpc closures to stop. But after wait, we did not stop task schedulers and fragments, and some query will continue to run.

In this PR, I try to stop 3 core thread pools, it will stop all queries.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@yiguolei yiguolei requested a review from morrySnow as a code owner August 14, 2025 09:30
@Thearas
Copy link
Contributor

Thearas commented Aug 14, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@yiguolei
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32758 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit cd11136bbcf3c3e3cc4b63857516d70654cdd051, data reload: false

------ Round 1 ----------------------------------
q1	17657	5619	5449	5449
q2	2058	278	198	198
q3	10499	1299	738	738
q4	10231	878	457	457
q5	7888	2430	2167	2167
q6	185	171	136	136
q7	901	740	621	621
q8	9335	1435	1200	1200
q9	5381	4923	4898	4898
q10	6745	2261	1808	1808
q11	471	286	263	263
q12	332	352	205	205
q13	17770	3701	3016	3016
q14	232	224	215	215
q15	537	476	453	453
q16	422	428	366	366
q17	618	875	364	364
q18	7073	6567	6403	6403
q19	1370	978	566	566
q20	341	337	202	202
q21	2977	2214	2014	2014
q22	1045	1021	1019	1019
Total cold run time: 104068 ms
Total hot run time: 32758 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5542	5609	5550	5550
q2	244	336	231	231
q3	2226	2653	2315	2315
q4	1353	1789	1340	1340
q5	4424	4901	5123	4901
q6	165	161	128	128
q7	2090	1973	1806	1806
q8	2619	2774	2702	2702
q9	7207	7116	7164	7116
q10	3041	3347	2822	2822
q11	575	515	478	478
q12	659	803	600	600
q13	3401	3761	3263	3263
q14	301	301	267	267
q15	521	483	471	471
q16	451	489	428	428
q17	1212	1749	1284	1284
q18	7716	7557	7316	7316
q19	844	1182	1062	1062
q20	2004	2082	1861	1861
q21	5322	4981	4770	4770
q22	1119	1075	1018	1018
Total cold run time: 53036 ms
Total hot run time: 51729 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 190126 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit cd11136bbcf3c3e3cc4b63857516d70654cdd051, data reload: false

query1	978	393	386	386
query2	6536	1938	1893	1893
query3	6707	212	226	212
query4	33902	23742	23385	23385
query5	4338	611	474	474
query6	283	199	197	197
query7	4645	505	322	322
query8	304	250	246	246
query9	9625	2660	2604	2604
query10	457	342	267	267
query11	18490	15476	15268	15268
query12	155	109	105	105
query13	1625	541	412	412
query14	9387	6688	6676	6676
query15	223	192	183	183
query16	7803	674	489	489
query17	1506	741	573	573
query18	1997	411	306	306
query19	207	187	163	163
query20	120	111	110	110
query21	205	127	108	108
query22	4324	4342	4130	4130
query23	34134	33825	33359	33359
query24	7687	2680	2630	2630
query25	539	479	427	427
query26	1240	287	184	184
query27	2526	458	339	339
query28	5598	2183	2122	2122
query29	847	584	454	454
query30	248	197	159	159
query31	1007	868	825	825
query32	95	65	67	65
query33	538	365	322	322
query34	739	831	524	524
query35	782	813	721	721
query36	1019	1071	966	966
query37	107	96	72	72
query38	3949	3895	3887	3887
query39	1479	1458	1438	1438
query40	218	121	110	110
query41	56	57	54	54
query42	125	104	104	104
query43	488	502	469	469
query44	1326	812	806	806
query45	190	180	171	171
query46	894	1049	674	674
query47	1871	1934	1856	1856
query48	419	443	373	373
query49	823	515	440	440
query50	671	686	429	429
query51	7231	7193	7205	7193
query52	110	104	95	95
query53	231	255	192	192
query54	578	569	484	484
query55	83	82	81	81
query56	295	279	280	279
query57	1239	1197	1166	1166
query58	283	227	226	226
query59	2940	3128	2916	2916
query60	303	305	276	276
query61	158	113	110	110
query62	796	717	688	688
query63	234	194	196	194
query64	4607	971	637	637
query65	3323	3224	3202	3202
query66	1099	428	314	314
query67	15745	15736	15528	15528
query68	7863	854	538	538
query69	485	305	273	273
query70	1243	1146	1117	1117
query71	499	296	268	268
query72	5686	3731	3810	3731
query73	648	747	362	362
query74	10223	9364	8831	8831
query75	3191	3152	2652	2652
query76	3241	1191	770	770
query77	474	383	285	285
query78	10352	10403	9629	9629
query79	2905	916	611	611
query80	670	535	437	437
query81	508	255	223	223
query82	653	126	89	89
query83	175	172	151	151
query84	245	93	86	86
query85	811	354	296	296
query86	393	315	294	294
query87	4263	4399	4262	4262
query88	4919	2406	2377	2377
query89	408	338	295	295
query90	1889	192	191	191
query91	139	156	111	111
query92	67	59	55	55
query93	1894	930	544	544
query94	699	386	299	299
query95	346	282	267	267
query96	491	609	285	285
query97	3205	3273	3164	3164
query98	230	208	209	208
query99	1534	1411	1355	1355
Total cold run time: 293036 ms
Total hot run time: 190126 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 28.51 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit cd11136bbcf3c3e3cc4b63857516d70654cdd051, data reload: false

query1	0.04	0.03	0.03
query2	0.07	0.03	0.03
query3	0.24	0.07	0.06
query4	1.62	0.11	0.10
query5	0.52	0.50	0.55
query6	1.14	0.74	0.72
query7	0.02	0.02	0.01
query8	0.04	0.04	0.03
query9	0.56	0.50	0.49
query10	0.56	0.55	0.57
query11	0.14	0.10	0.10
query12	0.14	0.11	0.11
query13	0.62	0.61	0.59
query14	0.79	0.80	0.79
query15	0.86	0.84	0.82
query16	0.38	0.37	0.37
query17	1.04	1.04	0.98
query18	0.24	0.23	0.22
query19	1.97	1.77	1.87
query20	0.02	0.01	0.01
query21	15.39	0.97	0.56
query22	0.74	0.69	0.54
query23	15.34	1.47	0.53
query24	3.47	1.58	0.96
query25	0.22	0.09	0.05
query26	0.22	0.14	0.12
query27	0.05	0.04	0.04
query28	13.82	0.98	0.44
query29	12.60	3.93	3.33
query30	0.25	0.09	0.07
query31	2.82	0.58	0.37
query32	3.23	0.54	0.45
query33	2.96	3.03	3.06
query34	16.68	5.22	4.46
query35	4.57	4.57	4.50
query36	0.68	0.50	0.47
query37	0.09	0.07	0.06
query38	0.04	0.04	0.03
query39	0.03	0.02	0.03
query40	0.17	0.13	0.13
query41	0.08	0.02	0.02
query42	0.04	0.02	0.02
query43	0.04	0.03	0.03
Total cold run time: 104.54 s
Total hot run time: 28.51 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 0.00% (0/6) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 45.46% (12713/27965)
Line Coverage 36.34% (113307/311787)
Region Coverage 33.98% (64854/190866)
Branch Coverage 31.00% (34012/109724)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (6/6) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 77.53% (21354/27543)
Line Coverage 71.79% (223443/311266)
Region Coverage 69.82% (133885/191765)
Branch Coverage 63.45% (69982/110288)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 100.00% (6/6) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 76.29% (21012/27543)
Line Coverage 69.68% (216876/311266)
Region Coverage 67.70% (129834/191765)
Branch Coverage 61.29% (67599/110288)

@yiguolei yiguolei changed the title [be](exit) call stop before brpc server stop to stop queries and allow brpc exist gracefully [NoMergeLatest](exit) call stop before brpc server stop to stop queries and allow brpc exist gracefully Aug 15, 2025
@yiguolei
Copy link
Contributor Author

run p0

@yiguolei
Copy link
Contributor Author

run p0

1 similar comment
@yiguolei
Copy link
Contributor Author

run p0

@yiguolei
Copy link
Contributor Author

run p0

@yiguolei
Copy link
Contributor Author

run p0

1 similar comment
@yiguolei
Copy link
Contributor Author

run p0

@yiguolei
Copy link
Contributor Author

run p0

1 similar comment
@yiguolei
Copy link
Contributor Author

run p0

@yiguolei
Copy link
Contributor Author

run p0

4 similar comments
@yiguolei
Copy link
Contributor Author

run p0

@yiguolei
Copy link
Contributor Author

run p0

@yiguolei
Copy link
Contributor Author

run p0

@yiguolei
Copy link
Contributor Author

run p0

PipelineTask is also hold by task queue(
apache#49753), so that it maybe the last
one to be destructed. But pipeline task hold some objects, like
operators, shared state, etc. So that should release memory manually.

F20250908 20:07:41.329619 39575 mem_tracker_limiter.cpp:112] mem tracker
label: Query#Id=ec8535b35ed34f54-afd752d5d1dd97c1, consumption: 16640,
peak consumption: 16640, mem tracker not equal to 0 when mem
tracker destruct, this usually means that memory tracking is inaccurate
and SCOPED_ATTACH_TASK and SCOPED_SWITCH_THREAD_MEM_TRACKER_LIMITER are
not used correctly. If the log is truncated, search for `Address
Sanitizer` in the be.INFO log to see more information.1. For query and
load, memory leaks may have occurred, it is expected that the query mem
tracker will be bound to the thread context using SCOPED_ATTACH_TASK and
SCOPED_SWITCH_THREAD_MEM_TRACKER_LIMITER before all memory alloc and
free. 2. If a memory alloc is recorded by this tracker, it is expected
that be recorded in this tracker when memory is freed. 3. Merge the
remaining memory tracking value by this tracker into Orphan, if you
observe that Orphan is not equal to 0 in the mem tracker web or log,
this indicates that there may be a memory leak. 4. If you need to
transfer memory tracking value between two trackers, can use
transfer_to..[Address Sanitizer]:
 memory not be freed:
[Address Sanitizer] buf not be freed, mem tracker label:
Query#Id=ec8535b35ed34f54-afd752d5d1dd97c1, consumption: 16640, peak
consumption: 16640, buf: 0x7d87c8761d00, size 4096, strack trace:
0# doris::Allocator<false, false, false, doris::DefaultMemoryAllocator,
false>::alloc(unsigned long, unsigned long)
1# void doris::vectorized::PODArrayBase<1ul, 4096ul,
doris::Allocator<false, false, false, doris::DefaultMemoryAllocator,
false>, 16ul, 15ul>::alloc<>(unsigned long)
2# void doris::vectorized::PODArray<signed char, 4096ul,
doris::Allocator<false, false, false, doris::DefaultMemoryAllocator,
false>, 16ul, 15ul>::push_back<long>(long&&)
3#
doris::vectorized::ColumnVector<(doris::PrimitiveType)3>::insert(doris::vectorized::Field
const&)
4# doris::vectorized::IDataType::create_column_const(unsigned long,
doris::vectorized::Field const&) const
        5#  doris::vectorized::VLiteral::init(doris::TExprNode const&)
6# doris::vectorized::VLiteral::VLiteral(doris::TExprNode const&, bool)
7#
std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<doris::vectorized::VLiteral,
std::allocator<void>, doris::TExprNode
const&>(doris::vectorized::VLiteral*&,
std::_Sp_alloc_shared_tag<std::allocator<void> >, doris::TExprNode
const&)
8# doris::vectorized::VExpr::create_expr(doris::TExprNode const&,
std::shared_ptr<doris::vectorized::VExpr>&)
9#
doris::vectorized::VExpr::create_tree_from_thrift(std::vector<doris::TExprNode,
std::allocator<doris::TExprNode> > const&, int*,
std::shared_ptr<doris::vectorized::VExpr>&,
std::shared_ptr<doris::vectorized::VExprContext>&)
10# doris::vectorized::VExpr::create_expr_tree(doris::TExpr const&,
std::shared_ptr<doris::vectorized::VExprContext>&)
11# doris::pipeline::OperatorXBase::init(doris::TPlanNode const&,
doris::RuntimeState*)
12#
doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::init(doris::TPlanNode
const&, doris::RuntimeState*)
13#
doris::pipeline::PipelineFragmentContext::_create_tree_helper(doris::ObjectPool*,
std::vector<doris::TPlanNode, std::allocator<doris::TPlanNode> > const&,
doris::TPipelineFragmentParams const&, doris::DescriptorTbl const&,
std::shared_ptr<doris::pipeline::OperatorXBase>, int*,
std::shared_ptr<doris::pipeline::OperatorXBase>*,
std::shared_ptr<doris::pipeline::Pipeline>&, int, bool)
14#
doris::pipeline::PipelineFragmentContext::_build_pipelines(doris::ObjectPool*,
doris::TPipelineFragmentParams const&, doris::DescriptorTbl const&,
std::shared_ptr<doris::pipeline::OperatorXBase>*,
std::shared_ptr<doris::pipeline::Pipeline>)
15#
doris::pipeline::PipelineFragmentContext::prepare(doris::TPipelineFragmentParams
const&, doris::ThreadPool*)
16#
doris::FragmentMgr::exec_plan_fragment(doris::TPipelineFragmentParams
const&, doris::QuerySource, std::function<void (doris::RuntimeState*,
doris::Status*)> const&, doris::TPipelineFragmentParamsList const&)
17#
doris::FragmentMgr::exec_plan_fragment(doris::TPipelineFragmentParams
const&, doris::QuerySource, doris::TPipelineFragmentParamsList const&)
18#
doris::PInternalService::_exec_plan_fragment_impl(std::__cxx11::basic_string<char,
std::char_traits<char>, std::allocator<char> > const&,
doris::PFragmentRequestVersion, bool, std::function<void
(doris::RuntimeState*, doris::Status*)> const&)
19#
doris::PInternalService::_exec_plan_fragment_in_pthread(google::protobuf::RpcController*,
doris::PExecPlanFragmentRequest const*, doris::PExecPlanFragmentResult*,
google::protobuf::Closure*)
        20# doris::WorkThreadPool<false>::work_thread(int)
        21# execute_native_thread_routine
        22# asan_thread_start(void*)
        23# ?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants