Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creation time is added to wal records #17233

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

KosovGrigorii
Copy link

@k8s-ci-robot
Copy link

Hi @KosovGrigorii. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Grigorii Kosov <kosov.gr@gmail.com>
Signed-off-by: KosovGrigorii <72564996+KosovGrigorii@users.noreply.github.com>
…o WAL records

wal test was changed due to the changes in WAL structure
in TestRepairWriteTearLast i changed number of expected entries from 40 to 29 because now one wal record takes morespace and therefore after truncating a wal file less records remain
Signed-off-by: Grigorii Kosov <kosov.gr@gmail.com>
Signed-off-by: KosovGrigorii <72564996+KosovGrigorii@users.noreply.github.com>
@KosovGrigorii KosovGrigorii force-pushed the wal-reord-creation-time branch from 08bc5da to 3dc1e4f Compare January 12, 2024 10:42
@siyuanfoundation
Copy link
Contributor

/cc @vivekpatani

@k8s-ci-robot
Copy link

@siyuanfoundation: GitHub didn't allow me to request PR reviews from the following users: vivekpatani.

Note that only etcd-io members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @vivekpatani

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Denchick
Copy link

Denchick commented Oct 11, 2024

/cc @wenjiaswe

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: KosovGrigorii, x4m
Once this PR has been reviewed and has the lgtm label, please assign serathius for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Denchick
Copy link

Denchick commented Oct 15, 2024

/cc @ahrtr

Hi!

We're currently working on the SPQR project, which is about PostgreSQL sharding. We store our metadata in etcd and already know how to restore shards to a specific point in time using WAL-G.

But there is no PITR support for etcd, so this PR is very important to us. Could you please to approve us to run GitHub Workflows?

@k8s-ci-robot k8s-ci-robot requested a review from ahrtr October 15, 2024 16:43
@jmhbnz
Copy link
Member

jmhbnz commented Nov 19, 2024

/ok-to-test

@x4m
Copy link

x4m commented Nov 27, 2024

Bump

@serathius
Copy link
Member

serathius commented Dec 2, 2024

Please propose a design on how the time will be used and how you will reconcile the clock drift in multi member scenario. Note, WAL writing time can be arbitrary delayed, as entries are written by members independently. You can have a node down for maintenance for a week and after it rejoins the cluster it write its WAL entries with timestamp week later than other nodes.

That's why etcd doesn't have any notion of saving any timestamp anywhere. It never depends on absolute time, only to time period (1s timeout request or 10s TTL). Adding it back needs to be done with proper considerations for distributed system like etcd.

@x4m
Copy link

x4m commented Dec 3, 2024

@serathius thank you for your inquiry.

When restoring a database from a backup to a point-in-time we do not need to coordinate nodes from a fault tolerance group. We need just one, restoring to specific point and leaving rest history aside.
In a sharded cluster point-in-time is trickier. E.g. in Greenplum we inject points-of-consistency into WAL when no multishard transactions were in-flight. In MongoDB we have a global history marks, so we translate given time into history mark on any shard. And then restore cluster to that mark.
I do not know which approach will be taken. But we did it, and did it many times.

It never depends on absolute time

Indeed. Response time of almost a year is completely novel to us. Developers who were working on the project already changed a team. We will find someone, of course.

@ahrtr
Copy link
Member

ahrtr commented Jan 8, 2025

  • As mentioned above by @serathius , we don't save any timestamp anywhere. We don't see any value to etcd itself to add timestamp into WAL file.
  • etcd stores data in bbolt, WAL data is just protection mechanism for potential data loss in abnormal cases, i.e. VM or process crashes. I don't see how to do point-in-time recovery using WAL files/data.

@x4m
Copy link

x4m commented Jan 8, 2025

@ahrtr consider symmetric point of view: etcd stores data in WAL, bbolt is just a cache for the case when no crashes happened.
To support this point of view I can name some invariants, that I think are uphold:

  1. WAL is always flushed before bbolt is written
  2. Each individual WAL record might be last during recovery, still leaving database in consistent state.

PITR is just a recovery constrained by some point in time. That's how PITR is working in Oracle, Postgres, MS-SQL, Cloudberry and many other databases.

We don't see any value to etcd itself to add timestamp into WAL file.

Do you mean "etcd project don't see any value in PITR"?

@serathius
Copy link
Member

PITR is just a recovery constrained by some point in time. That's how PITR is working in Oracle, Postgres, MS-SQL, Cloudberry and many other databases.

Those databases are all single master that decides the time. Etcd explicitly avoids trusting clock of any member and doesn't use absolute time. Not saying that you could not solve the problem, but it doesn't seem like something that could be easily integrated into etcd.

@x4m
Copy link

x4m commented Jan 9, 2025

@serathius it is not a problem for PITR at all, just use any source of time.

Those databases are all single master that decides the time.

No, it's not true. We have different timelines, we have different primaries in different shards, and of course we have different primaries at different moments.

In Postgres HA installation you have a WAL time recordeded according to clocks of current Primary for every WAL record. But you easily can promote Standby at any given moment. And the history would be recorded according to clocks of current Primary.

In MySQL stuff is much more interesting, because you have a vector clock + GTID - every node writes it's own binlogs, and nodes slice logs differently. So we record history with the overlap. E.i. when primary switches over to secondary, we have some tail of committed transactions on old primary, that are available on secondary too, we deduplicate them using GTID and use earliest timestamp from different logs.

Leaderless replication is much easier wrt PITR. It is not even shardeded, so we do not have to deal with cross-shard consistency as we do in MongoDB, Greenplum and Cloudberry.

Please, describe any specific problem, not generic "Etcd explicitly avoids trusting clock". Because, for point-in-time you need time anyway.

Time is obviously non-monotonic and you cannot trust it for event ordering within database. But for matching events with real world - it's perfectly fine. You can trust input of your user like "yesterday my data was OK", or "I've dropped important information 1 hour ago". No matter what time source you use for determining where to stop recovery.

@x4m
Copy link

x4m commented Jan 9, 2025

To prevent developers from misusing timestamps we can truncate it to, say, minutes. Anyone who will try to order something with this timestamp will quickly discovery that it's impossible. And for PITR it's OK if you can specify recovery minute, not a millisecond.
FWIW it's what MacOS is doing with nanoseconds when you invoke clock_gettime(RAW). Internally MacOS have 16ns measurements, but it will truncate timestamp to round microsecond.

@ahrtr
Copy link
Member

ahrtr commented Jan 9, 2025

2. Each individual WAL record might be last during recovery, still leaving database in consistent state.

This might not be true. Only committed WAL records are guaranteed to be safe. For example, if you recover to a WAL record, which wasn't committed, then you might end up committing a failed client write.

Do you mean "etcd project don't see any value in PITR"?

No any value to etcd itself to add timestamp into WAL file.

Backup & restore is important. Also I did not say PITR isn't important. I was saying I do not see how easily to do point-in-time recovery using WAL files/data.

Usually PITR depends on a base backup (full backup, i.e snapshots) + incremental logs (i.e. WAL logs). But in etcd, the V2 snapshots files (*.snap files) do not contain any real k/v data, and we are deprecating it. All data are stored in bbolt db, but the relationship between bbolt and WAL files isn't snapshot + incremental logs (usually a snapshot is static and immutable, but bbolt db is always dynamic). It's hard to recover to a point(data record) which is already in bbolt db.

It isn't clear how you will implement PITR. Are you going to do periodically backup for the bbolt db & WAL files, and select a right one which falls into the time range when doing PITR?

@x4m
Copy link

x4m commented Jan 9, 2025

@ahrtr currently our WIP state is described here. We do basebackups and apply WAL that happened after snapshot. If we got something wrong - let's redesing, but so far we only hear iterations about properties of time that are very well known and do not actually create a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

Point in time recovery in ETCD
8 participants