Skip to content

checkoint coordinator: handle failure on saving zero checkpoint #13917

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jan 29, 2025

Conversation

yumkam
Copy link
Collaborator

@yumkam yumkam commented Jan 28, 2025

Changelog entry

...

Changelog category

  • Bugfix
  • Not for changelog (changelog entry is not required)

Additional information

...

@yumkam yumkam changed the title checkoint coordinator: handle failure on saving zero checkpoint [WIP][RFC] checkoint coordinator: handle failure on saving zero checkpoint Jan 28, 2025
@github-actions github-actions bot added bugfix and removed bugfix labels Jan 28, 2025
Copy link

github-actions bot commented Jan 28, 2025

2025-01-28 08:54:13 UTC Pre-commit check linux-x86_64-relwithdebinfo for 8fe70cd has started.
2025-01-28 08:54:24 UTC Artifacts will be uploaded here
2025-01-28 08:57:23 UTC ya make is running...
2025-01-28 09:36:59 UTC Check cancelled

Copy link

github-actions bot commented Jan 28, 2025

2025-01-28 08:55:39 UTC Pre-commit check linux-x86_64-release-asan for 8fe70cd has started.
2025-01-28 08:55:51 UTC Artifacts will be uploaded here
2025-01-28 08:58:58 UTC ya make is running...
2025-01-28 09:36:57 UTC Check cancelled

@@ -372,11 +372,12 @@ void TCheckpointCoordinator::Handle(const TEvCheckpointCoordinator::TEvScheduleC
CC_LOG_D("Got TEvScheduleCheckpointing");
ScheduleNextCheckpoint();
const auto checkpointsInFly = PendingCheckpoints.size() + PendingCommitCheckpoints.size();
if (checkpointsInFly >= Settings.GetMaxInflight() || InitingZeroCheckpoint) {
if (checkpointsInFly >= Settings.GetMaxInflight() || (InitingZeroCheckpoint && !FailedZeroCheckpoint)) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Тогда тесты неправильно написаны, есть тест ShouldAbortPreviousCheckpointsIfNodeStateCantBeSaved.
В нем есть MockRunGraph(), которого видимо не должно быть. А когда он должен приходить непонятно

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Он должен приходить после сохранения ZeroCheckpoint (а в тесте есть только симуляция одного -- первого -- сохранения чекпоинта -- неуспешная); никакого влияния ни на что это не оказывает, но правильнее его убрать

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

В принципе, ещё можно добавить ассерты в Handle(EvRunGraph).

    Y_ABORT_UNLESS(InitingZeroCheckpoint);
    Y_ABORT_UNLESS(!FailedZeroCheckpoint);

В норме оно вроде должно работать (TEvZeroCheckpointDone->SetLoadFromCheckpointMode->Ping(SetLoadFrom...Cookie)->EvRunGraph), но в тестах что-то налажано и они с этим падают :-|

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Разобрался с тестами, добавил assert для не-релиза

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Так теперь если ZeroCheckpoint фейлится, то TEvZeroCheckpointDone не отправляется и граф не запускается. А чекпойнты начнут идти. Может лучше граф явно зафейлить.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Если я правильно понял TODO, то граф работает сразу. Но раньше несохранившийся zero checkpoint ломал чекпоинты вообще, а теперь первый же сохранившийся просто играет роль zero checkpoint.
Если ошибка транзиентная, и просто повторение поможет -- всё отлично.
Если ошибка не транзиентная, и чекпоинты вообще не сохраняются, никакой разницы с тем, что это начало происходить после zero checkpoint: нужны алерты за failure rate.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Как всё запутано) Если граф сразу запускается, тогда InitingZeroCheckpoint означает просто признак необходимости отравить TEvZeroCheckpointDone при первом же успешном чекпойнте. И в условии checkpointsInFly >= Settings.GetMaxInflight() все эти флаги не нужны.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Там ещё отдельная ветка с восстановлением из checkpoint, там тоже выставляется InitingZeroCheckpoint

Copy link

github-actions bot commented Jan 28, 2025

2025-01-28 09:39:21 UTC Pre-commit check linux-x86_64-relwithdebinfo for 7aae092 has started.
2025-01-28 09:39:33 UTC Artifacts will be uploaded here
2025-01-28 09:42:32 UTC ya make is running...
🟢 2025-01-28 10:42:36 UTC Tests successful.

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
25204 22667 0 0 2414 123

🟢 2025-01-28 10:44:50 UTC Build successful.
🟡 2025-01-28 10:45:05 UTC ydbd size 2.1 GiB changed* by +127.8 KiB, which is >= 100.0 KiB vs main: Warning

ydbd size dash main: 44b7587 merge: 7aae092 diff diff %
ydbd size 2 224 137 040 Bytes 2 224 267 872 Bytes +127.8 KiB +0.006%
ydbd stripped size 470 246 384 Bytes 470 252 080 Bytes +5.6 KiB +0.001%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Jan 28, 2025

2025-01-28 09:39:29 UTC Pre-commit check linux-x86_64-release-asan for 7aae092 has started.
2025-01-28 09:39:57 UTC Artifacts will be uploaded here
2025-01-28 09:42:59 UTC ya make is running...
🟡 2025-01-28 11:08:06 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
11088 11014 0 22 17 35

2025-01-28 11:09:08 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-01-28 11:27:27 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
106 (only retried tests) 72 0 2 4 28

2025-01-28 11:27:36 UTC ya make is running... (failed tests rerun, try 3)
🟡 2025-01-28 11:39:19 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
58 (only retried tests) 24 0 2 5 27

🟢 2025-01-28 11:39:26 UTC Build successful.
🟢 2025-01-28 11:39:55 UTC ydbd size 3.6 GiB changed* by +528 Bytes, which is < 100.0 KiB vs main: OK

ydbd size dash main: 777ff9a merge: 7aae092 diff diff %
ydbd size 3 864 980 680 Bytes 3 864 981 208 Bytes +528 Bytes +0.000%
ydbd stripped size 1 351 619 888 Bytes 1 351 620 336 Bytes +448 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

... on simulating save failure.
Only one (zero'th) checkpoint save is simulated; as it simulates
failure, EvZeroCheckpointDone is not expected to be sent:
Expected calls sequence is
EvZeroCheckpointDone->EvForwardPing(SetLoadMode...Cookie)->EvRunGraph
Copy link

github-actions bot commented Jan 28, 2025

2025-01-28 12:29:41 UTC Pre-commit check linux-x86_64-relwithdebinfo for c0e5898 has started.
2025-01-28 12:29:55 UTC Artifacts will be uploaded here
2025-01-28 12:32:45 UTC ya make is running...
2025-01-28 12:37:20 UTC Check cancelled

Copy link

github-actions bot commented Jan 28, 2025

2025-01-28 12:30:05 UTC Pre-commit check linux-x86_64-release-asan for c0e5898 has started.
2025-01-28 12:30:26 UTC Artifacts will be uploaded here
2025-01-28 12:33:19 UTC ya make is running...
2025-01-28 12:37:15 UTC Check cancelled

Copy link

github-actions bot commented Jan 28, 2025

2025-01-28 12:38:56 UTC Pre-commit check linux-x86_64-release-asan for 55a3c8b has started.
2025-01-28 12:39:16 UTC Artifacts will be uploaded here
2025-01-28 12:42:08 UTC ya make is running...
🟡 2025-01-28 13:38:34 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
11089 11020 0 21 14 34

2025-01-28 13:39:48 UTC ya make is running... (failed tests rerun, try 2)
🟡 2025-01-28 13:51:50 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet Going to retry failed tests...

Test history | Ya make output | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
102 (only retried tests) 65 0 3 7 27

2025-01-28 13:51:58 UTC ya make is running... (failed tests rerun, try 3)
🟡 2025-01-28 14:03:29 UTC Some tests failed, follow the links below. This fail is not in blocking policy yet

Test history | Ya make output | Test bloat | Test bloat | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
62 (only retried tests) 27 0 2 7 26

🟢 2025-01-28 14:03:36 UTC Build successful.
🟢 2025-01-28 14:04:03 UTC ydbd size 3.6 GiB changed* by +528 Bytes, which is < 100.0 KiB vs main: OK

ydbd size dash main: e756b37 merge: 55a3c8b diff diff %
ydbd size 3 864 974 472 Bytes 3 864 975 000 Bytes +528 Bytes +0.000%
ydbd stripped size 1 351 618 096 Bytes 1 351 618 544 Bytes +448 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

Copy link

github-actions bot commented Jan 28, 2025

2025-01-28 12:41:10 UTC Pre-commit check linux-x86_64-relwithdebinfo for 55a3c8b has started.
2025-01-28 12:41:13 UTC Artifacts will be uploaded here
2025-01-28 12:44:14 UTC ya make is running...
🟢 2025-01-28 14:57:22 UTC Tests successful.

Test history | Ya make output | Test bloat

TESTS PASSED ERRORS FAILED SKIPPED MUTED?
25205 22645 0 0 2432 128

🟢 2025-01-28 15:00:24 UTC Build successful.
🟢 2025-01-28 15:00:43 UTC ydbd size 2.1 GiB changed* by +384 Bytes, which is < 100.0 KiB vs main: OK

ydbd size dash main: e756b37 merge: 55a3c8b diff diff %
ydbd size 2 224 264 464 Bytes 2 224 264 848 Bytes +384 Bytes +0.000%
ydbd stripped size 470 251 248 Bytes 470 251 504 Bytes +256 Bytes +0.000%

*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation

@yumkam yumkam changed the title [WIP][RFC] checkoint coordinator: handle failure on saving zero checkpoint [RFC] checkoint coordinator: handle failure on saving zero checkpoint Jan 28, 2025
@github-actions github-actions bot added bugfix and removed bugfix labels Jan 28, 2025
@yumkam yumkam marked this pull request as ready for review January 28, 2025 16:31
@yumkam yumkam requested a review from a team as a code owner January 28, 2025 16:31
@yumkam yumkam requested a review from kardymonds January 28, 2025 16:31
@yumkam yumkam changed the title [RFC] checkoint coordinator: handle failure on saving zero checkpoint checkoint coordinator: handle failure on saving zero checkpoint Jan 29, 2025
@github-actions github-actions bot added bugfix and removed bugfix labels Jan 29, 2025
@yumkam yumkam merged commit 755e82b into ydb-platform:main Jan 29, 2025
14 checks passed
yumkam added a commit to yumkam/ydb that referenced this pull request Jan 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants