Creation time is added to wal records #17233

KosovGrigorii · 2024-01-12T10:27:10Z

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

This PR resolves #16962

k8s-ci-robot · 2024-01-12T10:27:20Z

Hi @KosovGrigorii. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Grigorii Kosov <kosov.gr@gmail.com> Signed-off-by: KosovGrigorii <72564996+KosovGrigorii@users.noreply.github.com>

…o WAL records wal test was changed due to the changes in WAL structure in TestRepairWriteTearLast i changed number of expected entries from 40 to 29 because now one wal record takes morespace and therefore after truncating a wal file less records remain Signed-off-by: Grigorii Kosov <kosov.gr@gmail.com> Signed-off-by: KosovGrigorii <72564996+KosovGrigorii@users.noreply.github.com>

siyuanfoundation · 2024-04-25T18:05:14Z

/cc @vivekpatani

k8s-ci-robot · 2024-04-25T18:05:17Z

@siyuanfoundation: GitHub didn't allow me to request PR reviews from the following users: vivekpatani.

Note that only etcd-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @vivekpatani

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Denchick · 2024-10-11T12:20:46Z

/cc @wenjiaswe

k8s-ci-robot · 2024-10-11T12:59:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KosovGrigorii, x4m
Once this PR has been reviewed and has the lgtm label, please assign serathius for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Denchick · 2024-10-15T16:43:33Z

/cc @ahrtr

Hi!

We're currently working on the SPQR project, which is about PostgreSQL sharding. We store our metadata in etcd and already know how to restore shards to a specific point in time using WAL-G.

But there is no PITR support for etcd, so this PR is very important to us. Could you please to approve us to run GitHub Workflows?

jmhbnz · 2024-11-19T18:15:52Z

/ok-to-test

x4m · 2024-11-27T09:44:30Z

Bump

serathius · 2024-12-02T13:41:36Z

Please propose a design on how the time will be used and how you will reconcile the clock drift in multi member scenario. Note, WAL writing time can be arbitrary delayed, as entries are written by members independently. You can have a node down for maintenance for a week and after it rejoins the cluster it write its WAL entries with timestamp week later than other nodes.

That's why etcd doesn't have any notion of saving any timestamp anywhere. It never depends on absolute time, only to time period (1s timeout request or 10s TTL). Adding it back needs to be done with proper considerations for distributed system like etcd.

x4m · 2024-12-03T14:47:30Z

@serathius thank you for your inquiry.

When restoring a database from a backup to a point-in-time we do not need to coordinate nodes from a fault tolerance group. We need just one, restoring to specific point and leaving rest history aside.
In a sharded cluster point-in-time is trickier. E.g. in Greenplum we inject points-of-consistency into WAL when no multishard transactions were in-flight. In MongoDB we have a global history marks, so we translate given time into history mark on any shard. And then restore cluster to that mark.
I do not know which approach will be taken. But we did it, and did it many times.

It never depends on absolute time

Indeed. Response time of almost a year is completely novel to us. Developers who were working on the project already changed a team. We will find someone, of course.

ahrtr · 2025-01-08T14:54:05Z

As mentioned above by @serathius , we don't save any timestamp anywhere. We don't see any value to etcd itself to add timestamp into WAL file.
etcd stores data in bbolt, WAL data is just protection mechanism for potential data loss in abnormal cases, i.e. VM or process crashes. I don't see how to do point-in-time recovery using WAL files/data.

x4m · 2025-01-08T16:05:17Z

@ahrtr consider symmetric point of view: etcd stores data in WAL, bbolt is just a cache for the case when no crashes happened.
To support this point of view I can name some invariants, that I think are uphold:

WAL is always flushed before bbolt is written
Each individual WAL record might be last during recovery, still leaving database in consistent state.

PITR is just a recovery constrained by some point in time. That's how PITR is working in Oracle, Postgres, MS-SQL, Cloudberry and many other databases.

We don't see any value to etcd itself to add timestamp into WAL file.

Do you mean "etcd project don't see any value in PITR"?

serathius · 2025-01-09T09:10:03Z

PITR is just a recovery constrained by some point in time. That's how PITR is working in Oracle, Postgres, MS-SQL, Cloudberry and many other databases.

Those databases are all single master that decides the time. Etcd explicitly avoids trusting clock of any member and doesn't use absolute time. Not saying that you could not solve the problem, but it doesn't seem like something that could be easily integrated into etcd.

x4m · 2025-01-09T09:47:37Z

@serathius it is not a problem for PITR at all, just use any source of time.

Those databases are all single master that decides the time.

No, it's not true. We have different timelines, we have different primaries in different shards, and of course we have different primaries at different moments.

In Postgres HA installation you have a WAL time recordeded according to clocks of current Primary for every WAL record. But you easily can promote Standby at any given moment. And the history would be recorded according to clocks of current Primary.

In MySQL stuff is much more interesting, because you have a vector clock + GTID - every node writes it's own binlogs, and nodes slice logs differently. So we record history with the overlap. E.i. when primary switches over to secondary, we have some tail of committed transactions on old primary, that are available on secondary too, we deduplicate them using GTID and use earliest timestamp from different logs.

Leaderless replication is much easier wrt PITR. It is not even shardeded, so we do not have to deal with cross-shard consistency as we do in MongoDB, Greenplum and Cloudberry.

Please, describe any specific problem, not generic "Etcd explicitly avoids trusting clock". Because, for point-in-time you need time anyway.

Time is obviously non-monotonic and you cannot trust it for event ordering within database. But for matching events with real world - it's perfectly fine. You can trust input of your user like "yesterday my data was OK", or "I've dropped important information 1 hour ago". No matter what time source you use for determining where to stop recovery.

x4m · 2025-01-09T09:52:41Z

To prevent developers from misusing timestamps we can truncate it to, say, minutes. Anyone who will try to order something with this timestamp will quickly discovery that it's impossible. And for PITR it's OK if you can specify recovery minute, not a millisecond.
FWIW it's what MacOS is doing with nanoseconds when you invoke clock_gettime(RAW). Internally MacOS have 16ns measurements, but it will truncate timestamp to round microsecond.

ahrtr · 2025-01-09T10:13:14Z

2. Each individual WAL record might be last during recovery, still leaving database in consistent state.

This might not be true. Only committed WAL records are guaranteed to be safe. For example, if you recover to a WAL record, which wasn't committed, then you might end up committing a failed client write.

Do you mean "etcd project don't see any value in PITR"?

No any value to etcd itself to add timestamp into WAL file.

Backup & restore is important. Also I did not say PITR isn't important. I was saying I do not see how easily to do point-in-time recovery using WAL files/data.

Usually PITR depends on a base backup (full backup, i.e snapshots) + incremental logs (i.e. WAL logs). But in etcd, the V2 snapshots files (*.snap files) do not contain any real k/v data, and we are deprecating it. All data are stored in bbolt db, but the relationship between bbolt and WAL files isn't snapshot + incremental logs (usually a snapshot is static and immutable, but bbolt db is always dynamic). It's hard to recover to a point(data record) which is already in bbolt db.

It isn't clear how you will implement PITR. Are you going to do periodically backup for the bbolt db & WAL files, and select a right one which falls into the time range when doing PITR?

x4m · 2025-01-09T10:51:00Z

@ahrtr currently our WIP state is described here. We do basebackups and apply WAL that happened after snapshot. If we got something wrong - let's redesing, but so far we only hear iterations about properties of time that are very well known and do not actually create a problem.

k8s-ci-robot added the needs-ok-to-test label Jan 12, 2024

KosovGrigorii added 2 commits January 12, 2024 15:41

etcdserver: creation date added to wal record in proto

bdb249c

Signed-off-by: Grigorii Kosov <kosov.gr@gmail.com> Signed-off-by: KosovGrigorii <72564996+KosovGrigorii@users.noreply.github.com>

KosovGrigorii force-pushed the wal-reord-creation-time branch from 08bc5da to 3dc1e4f Compare January 12, 2024 10:42

Denchick mentioned this pull request Oct 11, 2024

Point in time recovery in ETCD #16962

Open

k8s-ci-robot requested a review from wenjiaswe October 11, 2024 12:20

x4m approved these changes Oct 11, 2024

View reviewed changes

k8s-ci-robot requested a review from ahrtr October 15, 2024 16:43

k8s-ci-robot added ok-to-test and removed needs-ok-to-test labels Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creation time is added to wal records #17233

Creation time is added to wal records #17233

KosovGrigorii commented Jan 12, 2024

k8s-ci-robot commented Jan 12, 2024

siyuanfoundation commented Apr 25, 2024

k8s-ci-robot commented Apr 25, 2024

Denchick commented Oct 11, 2024 •

edited

Loading

k8s-ci-robot commented Oct 11, 2024

Denchick commented Oct 15, 2024 •

edited

Loading

jmhbnz commented Nov 19, 2024

x4m commented Nov 27, 2024

serathius commented Dec 2, 2024 •

edited

Loading

x4m commented Dec 3, 2024 •

edited

Loading

ahrtr commented Jan 8, 2025

x4m commented Jan 8, 2025 •

edited

Loading

serathius commented Jan 9, 2025

x4m commented Jan 9, 2025 •

edited

Loading

x4m commented Jan 9, 2025 •

edited

Loading

ahrtr commented Jan 9, 2025

x4m commented Jan 9, 2025

Creation time is added to wal records #17233

Are you sure you want to change the base?

Creation time is added to wal records #17233

Conversation

KosovGrigorii commented Jan 12, 2024

k8s-ci-robot commented Jan 12, 2024

siyuanfoundation commented Apr 25, 2024

k8s-ci-robot commented Apr 25, 2024

Denchick commented Oct 11, 2024 • edited Loading

k8s-ci-robot commented Oct 11, 2024

Denchick commented Oct 15, 2024 • edited Loading

jmhbnz commented Nov 19, 2024

x4m commented Nov 27, 2024

serathius commented Dec 2, 2024 • edited Loading

x4m commented Dec 3, 2024 • edited Loading

ahrtr commented Jan 8, 2025

x4m commented Jan 8, 2025 • edited Loading

serathius commented Jan 9, 2025

x4m commented Jan 9, 2025 • edited Loading

x4m commented Jan 9, 2025 • edited Loading

ahrtr commented Jan 9, 2025

x4m commented Jan 9, 2025

Denchick commented Oct 11, 2024 •

edited

Loading

Denchick commented Oct 15, 2024 •

edited

Loading

serathius commented Dec 2, 2024 •

edited

Loading

x4m commented Dec 3, 2024 •

edited

Loading

x4m commented Jan 8, 2025 •

edited

Loading

x4m commented Jan 9, 2025 •

edited

Loading

x4m commented Jan 9, 2025 •

edited

Loading