[Enhancement] Optimize Tablet Report #54848

gengjun-git · 2025-01-08T11:23:51Z

Why I'm doing:

In StarRocks, FE will periodically diff the tablets in BE and the tablets recorded in metadata, and then process the inconsistent tablets. The current implementation is that BE reports the full number of tablets to the FE Leader regularly (default 1 minute), and the Leader maintains a reporting queue, and then retrieves one BE tablet from the queue each time for single-threaded processing. For large-scale clusters, the speed of FE processing usually cannot keep up with the speed of BE reporting, resulting in the existence of all BE tablets in the reporting queue, which causes memory waste. This optimization uses the Leader's active pull mode to control the tablets in the reporting queue within a BE range.

What I'm doing:

After optimization, a new TabletController daemon is added to regularly pull the full number of tablets from the Backend. The pull condition is

For a certain BE, it has been more than collect_tablet_inverval_seconds since the last pull.
The processing queue of ReportHandler is empty.

BE still retains the ability to actively report tablets to FE Leader, but only for emergency situations, such as disk corruption and the need to immediately remove replicas from FE metadata.

Test(a cluster with 5 million tablets)

after optimization

before optimization

We can see the GC time has become smoother.

What type of PR is this:

Does this PR entail a change in behavior?

Yes, this PR will result in a change in behavior.
No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

Interface/UI changes: syntax, type conversion, expression evaluation, display information
Parameter changes: default values, similar parameters but with different default values
Policy changes: use new policy to replace old one, functionality automatically enabled
Feature removed
Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

I have added test cases for my bug fix or my new feature
This pr needs user documentation (for new or modified features or behaviors)
- I have added documentation for my new feature or new function
This is a backport pr

Bugfix cherry-pick branch check:

HangyuanLiu · 2025-01-10T11:47:50Z

gensrc/thrift/HeartbeatService.thrift

@@ -34,6 +34,7 @@ struct TMasterInfo {
    11: optional list<string> disabled_disks
    12: optional list<string> decommissioned_disks
    13: optional bool encrypted;
+    14: optional bool stop_regular_tablet_report;


Add comment to explain what is flag mean, And which version can deprecate this flag

Signed-off-by: gengjun-git <gengjun@starrocks.com>

sonarqubecloud · 2025-01-23T03:22:50Z

Quality Gate passed

Issues
12 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2025-01-23T04:57:04Z

[Java-Extensions Incremental Coverage Report]

✅ pass : 0 / 0 (0%)

github-actions · 2025-01-23T04:59:37Z

[FE Incremental Coverage Report]

✅ pass : 71 / 84 (84.52%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	com/starrocks/leader/TabletCollector.java	40	53	75.47%	[100, 102, 103, 104, 105, 106, 108, 109, 110, 112, 116, 134, 138]
🔵	com/starrocks/common/Config.java	2	2	100.00%	[]
🔵	com/starrocks/metric/MetricRepo.java	4	4	100.00%	[]
🔵	com/starrocks/server/GlobalStateMgr.java	5	5	100.00%	[]
🔵	com/starrocks/system/HeartbeatMgr.java	1	1	100.00%	[]
🔵	com/starrocks/leader/LeaderImpl.java	1	1	100.00%	[]
🔵	com/starrocks/leader/ReportHandler.java	17	17	100.00%	[]
🔵	com/starrocks/memory/MemoryUsageTracker.java	1	1	100.00%	[]

github-actions · 2025-01-23T05:07:59Z

[BE Incremental Coverage Report]

✅ pass : 28 / 30 (93.33%)

file detail

	path	covered_line	new_line	coverage	not_covered_line_detail
🔵	be/src/service/service_be/backend_service.cpp	9	10	90.00%	[76]
🔵	be/src/agent/task_worker_pool.cpp	14	15	93.33%	[709]
🔵	be/src/agent/heartbeat_server.cpp	3	3	100.00%	[]
🔵	be/src/agent/task_worker_pool.h	2	2	100.00%	[]

wanpengfei-git added the PROTO-REVIEW label Jan 8, 2025

wanpengfei-git requested a review from a team January 8, 2025 11:24

mergify bot assigned gengjun-git Jan 8, 2025

gengjun-git force-pushed the opt_tablet_report branch from a0a9f14 to 00a3ee5 Compare January 9, 2025 09:58

gengjun-git marked this pull request as ready for review January 10, 2025 01:49

gengjun-git requested a review from a team as a code owner January 10, 2025 01:49

gengjun-git force-pushed the opt_tablet_report branch from 6007573 to b25d1a8 Compare January 10, 2025 07:03

HangyuanLiu self-assigned this Jan 10, 2025

gengjun-git force-pushed the opt_tablet_report branch 2 times, most recently from 3aa14d9 to fdc42ae Compare January 14, 2025 12:31

HangyuanLiu reviewed Jan 16, 2025

View reviewed changes

gengjun-git force-pushed the opt_tablet_report branch from fdc42ae to cd578d0 Compare January 22, 2025 06:21

add

22888be

Signed-off-by: gengjun-git <gengjun@starrocks.com>

gengjun-git force-pushed the opt_tablet_report branch from cd578d0 to 22888be Compare January 23, 2025 03:16

gengjun-git enabled auto-merge (squash) January 23, 2025 06:07

HangyuanLiu approved these changes Jan 23, 2025

View reviewed changes

wangruin approved these changes Jan 23, 2025

View reviewed changes

gengjun-git merged commit 40edf17 into StarRocks:main Jan 23, 2025
48 of 49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enhancement] Optimize Tablet Report #54848

[Enhancement] Optimize Tablet Report #54848

gengjun-git commented Jan 8, 2025 •

edited

Loading

HangyuanLiu Jan 10, 2025

nshangyiming Jan 20, 2025

sonarqubecloud bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

github-actions bot commented Jan 23, 2025

[Enhancement] Optimize Tablet Report #54848

[Enhancement] Optimize Tablet Report #54848

Conversation

gengjun-git commented Jan 8, 2025 • edited Loading

Why I'm doing:

What I'm doing:

Test(a cluster with 5 million tablets)

What type of PR is this:

Checklist:

Bugfix cherry-pick branch check:

HangyuanLiu Jan 10, 2025

Choose a reason for hiding this comment

nshangyiming Jan 20, 2025

Choose a reason for hiding this comment

sonarqubecloud bot commented Jan 23, 2025

Quality Gate passed

github-actions bot commented Jan 23, 2025

[Java-Extensions Incremental Coverage Report]

github-actions bot commented Jan 23, 2025

[FE Incremental Coverage Report]

file detail

github-actions bot commented Jan 23, 2025

[BE Incremental Coverage Report]

file detail

gengjun-git commented Jan 8, 2025 •

edited

Loading