Handle shard ownership lost when reading history tasks #7101

yycptt · 2025-01-17T00:47:30Z

What changed?

Wrap ExecutionManager.GetHistoryTasks in shard context to handler shardOwnershipLost errors.
Will need to follow up and audit all usage of this method and other methods on execution manager. This PR is only for fixing the issue we are seeing.

Why?

If shard has no (api) traffic, and ownership already got lost in the background (from some persistence implementations). Task processing can get stuck forever trying to load tasks and see this shard ownership lost error.

How did you test it?

Unit test

Potential risks

Documentation

Is hotfix candidate?

ychebotarev · 2025-01-17T00:55:29Z

service/history/shard/context_impl.go

+	ctx context.Context,
+	request *persistence.GetHistoryTasksRequest,
+) (*persistence.GetHistoryTasksResponse, error) {
+	if err := s.errorByState(); err != nil {


in I understand correctly main change - you "wrap" s.executionManager.GetHistoryTask with

s.errorByState()
and
s.handleReadError()
?

Yes. s.handleReadError is the key part which will check for shardOwnership lost and will put shard context in an invalid state. Then a new shard will be created once shard controller realizes the shard context is invalid.

## What changed?  - Wrap ExecutionManager.GetHistoryTasks in shard context to handler shardOwnershipLost errors. - Will need to follow up and audit all usage of this method and other methods on execution manager. This PR is only for fixing the issue we are seeing. ## Why?  - If shard has no (api) traffic, and ownership already got lost in the background (from some persistence implementations). Task processing can get stuck forever trying to load tasks and see this shard ownership lost error. ## How did you test it?  - Unit test ## Potential risks  ## Documentation  ## Is hotfix candidate?

Handle shard ownership lost when reading history tasks

4132c15

yycptt requested a review from alfred-landrum January 17, 2025 00:47

yycptt requested a review from a team as a code owner January 17, 2025 00:47

alfred-landrum approved these changes Jan 17, 2025

View reviewed changes

pdoerner approved these changes Jan 17, 2025

View reviewed changes

ychebotarev approved these changes Jan 17, 2025

View reviewed changes

yiminc approved these changes Jan 17, 2025

View reviewed changes

yycptt enabled auto-merge (squash) January 17, 2025 01:00

yycptt merged commit 72fa968 into temporalio:main Jan 17, 2025
49 checks passed

yycptt deleted the sol-get-history-tasks branch January 17, 2025 01:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle shard ownership lost when reading history tasks #7101

Handle shard ownership lost when reading history tasks #7101

yycptt commented Jan 17, 2025 •

edited

Loading

ychebotarev Jan 17, 2025

yycptt Jan 17, 2025

Handle shard ownership lost when reading history tasks #7101

Handle shard ownership lost when reading history tasks #7101

Conversation

yycptt commented Jan 17, 2025 • edited Loading

What changed?

Why?

How did you test it?

Potential risks

Documentation

Is hotfix candidate?

ychebotarev Jan 17, 2025

Choose a reason for hiding this comment

yycptt Jan 17, 2025

Choose a reason for hiding this comment

yycptt commented Jan 17, 2025 •

edited

Loading