Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/rawdb: freezer index repair #29792

Merged
merged 6 commits into from
Oct 1, 2024

Conversation

rjl493456442
Copy link
Member

@rjl493456442 rjl493456442 commented May 16, 2024

This pull request removes the fsync of index files in freezer.ModifyAncients function for
performance gain.

Originally, fsync is added after each freezer write operation to ensure the written data is truly
transferred into disk. Unfortunately, it turns out fsync could be relatively slow, especially on
macOS. see #28754 for more information.

In this pull request, fsync for index file is removed as it turns out index file can be recovered
even after a unclean shutdown. But fsync for data file is still kept, as we have no meaningful
way to validate the data correctness after unclean shutdown.


But why do we need the fsync in the first place?

As it's necessary for freezer to survive/recover after the machine crash (e.g. power failure).
In linux, whenever the file write is performed, the file metadata update and data update are
not necessarily performed at the same time. Typically, the metadata will be flushed/journalled
ahead of the file data. Therefore, we make the pessimistic assumption that the file is first
extended with invalid "garbage" data (normally zero bytes) and that afterwards the correct
data replaces the garbage.

We have observed that the index file of the freezer often contain garbage entry with zero value
(filenumber = 0, offset = 0) after a machine power failure. It proves that the index file is extended
without the data being flushed. And this corruption can destroy the whole freezer data eventually.

Performing fsync after each write operation can reduce the time window for data to be transferred
to the disk and ensure the correctness of the data in the disk to the greatest extent.


How can we maintain this guarantee without relying on fsync?

Because the items in the index file are strictly in order, we can leverage this characteristic to
detect the corruption and truncate them when freezer is opened. Specifically these validation
rules are performed for each index file:

For two consecutive index items:

  • If their file numbers are the same, then the offset of the latter one MUST not be less than that of the former.
  • If the file number of the latter one is equal to that of the former plus one, then the offset of the latter one MUST not be 0.
  • If their file numbers are not equal, and the latter's file number is not equal to the former plus 1, the latter one is valid

And also, for the first non-head item, it must refer to the earliest data file, or the next file if the
earliest file is not sufficient to place the first item(very special case, only theoretical possible
in tests)

With these validation rules, we can detect the invalid item in index file with greatest possibility.


But unfortunately, these scenarios are not covered and could still lead to a freezer corruption if it occurs:

All items in index file are in zero value

It's impossible to distinguish if they are truly zero (e.g. all the data entries maintained in freezer
are zero size) or just the garbage left by OS. In this case, these index items will be kept by truncating
the entire data file, namely the freezer is corrupted.

However, we can consider that the probability of this situation occurring is quite low, and even
if it occurs, the freezer can be considered to be close to an empty state. Rerun the state sync
should be acceptable.

Index file is integral while relative data file is corrupted

It might be possible the data file is corrupted whose file size is extended correctly with garbage
filled (e.g. zero bytes). In this case, it's impossible to detect the corruption by index validation.

We can either choose to fsync the data file, or blindly believe that if index file is integral then
the data file could be integral with very high chance. In this pull request, the first option is taken.

@rjl493456442 rjl493456442 force-pushed the freezer-index-validation branch 4 times, most recently from a6ed90b to 1b10ff8 Compare May 20, 2024 07:52
core/rawdb/freezer_batch.go Outdated Show resolved Hide resolved
core/rawdb/freezer_table.go Outdated Show resolved Hide resolved
core/rawdb/freezer_batch.go Outdated Show resolved Hide resolved
core/rawdb/freezer_table.go Outdated Show resolved Hide resolved
@rjl493456442 rjl493456442 force-pushed the freezer-index-validation branch 3 times, most recently from 9be55a0 to 660e460 Compare May 21, 2024 05:48
@rjl493456442 rjl493456442 marked this pull request as ready for review May 21, 2024 06:43
@rjl493456442 rjl493456442 requested a review from karalabe as a code owner May 21, 2024 06:43
}
// ensure two consecutive index items are in order
if err := t.checkIndexItems(prev, entry); err != nil {
return truncate(offset)
Copy link
Contributor

@fjl fjl May 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should log the error here. Maybe pass the error into truncate to log it as part of the warning it prints.

@rjl493456442
Copy link
Member Author

Performance wise, it takes ~0.5 second to verify ~20m index items, namely it will introduce ~2.5s delay in geth launching, but i think it's acceptable.

INFO [05-22|06:35:09.596] Verified index file                      items=19,803,925 elapsed=529.361ms
INFO [05-22|06:35:10.173] Verified index file                      items=19,803,925 elapsed=576.231ms
INFO [05-22|06:35:10.653] Verified index file                      items=19,803,925 elapsed=480.599ms
INFO [05-22|06:35:11.172] Verified index file                      items=19,803,925 elapsed=516.957ms
INFO [05-22|06:35:11.621] Verified index file                      items=19,803,925 elapsed=447.692ms

@rjl493456442
Copy link
Member Author

After benchmarking the pull request for a while (full sync for more than 24 hours), this branch is slightly faster than master. Specifically:

The average time on state history construction is reduced from 2.32ms to 1.57ms, by getting rid of the index file fsync.

截屏2024-05-24 09 58 01

This performance difference is negligible on the benchmark machine with power ssd. However, for normal users, it can make a big difference as fsync on an ordinary SSD does take time. For instance, Datafile fsync takes ~18ms per block, namely we can save 18ms per block by getting rid of indexfile fsync

截屏2024-05-24 09 59 48

@rjl493456442 rjl493456442 force-pushed the freezer-index-validation branch from 63ffdbb to 8381d61 Compare May 28, 2024 12:59
}
log.Warn("Truncated index file", "offset", offset, "truncated", size-offset)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions could just be methods on freezerTable

@@ -194,15 +194,10 @@ func (batch *freezerTableBatch) commit() error {
dataSize := int64(len(batch.dataBuffer))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we remove the sync call above as well, if we extend/modify the repair?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, we have no mechanism to detect the whether the data file (batch.t.head) contains garbage or not after the power failure. It's the reason I leave the sync operation for data file.

However, I think we can do the background file sync for data file, with a specific time interval. For example, mongo uses this strategy by flushing the file every 100 milliseconds

MongoDB also uses a write-ahead logging via a separate journal. Each write operation is appended to this journal, which is then flushed to disk every 100 milliseconds, taking aspects of the group commit and async commit approaches. If you cannot tolerate losing up to 100 milliseconds of data, you can optionally force an early flush of the journal when issuing a write, increasing the commit latency.

@fjl
Copy link
Contributor

fjl commented Aug 8, 2024

Results from triage discussion: we need to optimize the validation process to remove the startup delay. There were some ideas for optimization:

  • Validating in reverse is not possible: a small amount of random data can throw off the validation
  • We could parallelize the validation, but it would only really help on SSD, and the freezer is supposed to work on spinning disks as well.
  • @holiman voiced the idea of only checking the last 1M items. There are some safety concerns about this.
  • It would be safe to validate only the offsets pointing into the last data file. that's because we always flush all files when moving to a new data file. However, we would need a fast algorithm for locating the first index entry of the last data file to skip over most of the index.
  • We could store the last known good offset somewhere (in the KV-DB) and only verify from that offset onward. But note this offset would need to move back when we truncate the freezer.

@holiman
Copy link
Contributor

holiman commented Aug 26, 2024

Results from triage discussion: we need to optimize the validation process to remove the startup delay.

IIRC the reasoning was that if users are using an ancientdir on an HDD, it will take quite a long time to parse all index-files. However, we don't actually have any concrete numbers on that, do we?

I'm slightly leaning towards thinking that long startup-time isn't a big deal. If it takes a minute instead of 15 seconds, does it really matter? Whereas shutdown-lag is worse, since it can trigger kill-signals if not performed fast enough

@rjl493456442 rjl493456442 force-pushed the freezer-index-validation branch from 8381d61 to af6398c Compare August 27, 2024 10:48
Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good start, even if we may fine-tune it later

@fjl
Copy link
Contributor

fjl commented Aug 27, 2024

High startup time is not acceptable if it occurs every time. It's not OK if Geth suddenly starts taking a minute to start on mainnet. We'd have to make it async or something for this to be acceptable.

@holiman
Copy link
Contributor

holiman commented Aug 28, 2024

High startup time is not acceptable if it occurs every time. It's not OK if Geth suddenly starts taking a minute to start on mainnet. We'd have to make it async or something for this to be acceptable.

To follow-up on this. Here's an index-repair performed on an HDD, after a power failure (so no hot os-caches):

INFO [08-27|19:12:54.926] Verified index file                      items=20,619,465 elapsed=623.795ms
INFO [08-27|19:12:55.561] Verified index file                      items=20,619,465 elapsed=580.057ms
INFO [08-27|19:12:56.229] Verified index file                      items=20,619,465 elapsed=647.133ms
INFO [08-27|19:12:56.922] Verified index file                      items=20,619,465 elapsed=619.618ms
INFO [08-27|19:12:57.515] Verified index file                      items=20,619,465 elapsed=569.665ms

It seems to total 2-3s additional overhead - but if it's a regular restart of geth, with some potentially hot os-caches, the numbers can be a lot faster, like

INFO [08-27|18:55:00.472] Verified index file                      items=20,618,031 elapsed=168.120ms
INFO [08-27|18:55:00.667] Verified index file                      items=20,618,031 elapsed=194.537ms
INFO [08-27|18:55:00.821] Verified index file                      items=20,618,031 elapsed=153.609ms
INFO [08-27|18:55:00.976] Verified index file                      items=20,618,031 elapsed=154.121ms
INFO [08-27|18:55:01.129] Verified index file                      items=20,618,031 elapsed=152.758ms

Totalling sub second. I don't claim credit for these numbers, which comes from @rjl493456442 . But, I'd consider this acceptable.

@rjl493456442 rjl493456442 force-pushed the freezer-index-validation branch from af6398c to 0fe416d Compare September 9, 2024 06:21
@holiman
Copy link
Contributor

holiman commented Sep 13, 2024

I think we should merge this. Let's discuss next triage

@holiman holiman added this to the 1.14.12 milestone Oct 1, 2024
@holiman holiman merged commit eff0bed into ethereum:master Oct 1, 2024
3 checks passed
islishude pushed a commit to islishude/go-ethereum that referenced this pull request Oct 6, 2024
This pull request removes the `fsync` of index files in freezer.ModifyAncients function for 
performance gain.

Originally, fsync is added after each freezer write operation to ensure
the written data is truly transferred into disk. Unfortunately, it turns 
out `fsync` can be relatively slow, especially on
macOS (see ethereum#28754 for more
information). 

In this pull request, fsync for index file is removed as it turns out
index file can be recovered even after a unclean shutdown. But fsync for data file is still kept, as
we have no meaningful way to validate the data correctness after unclean shutdown.

---

**But why do we need the `fsync` in the first place?** 

As it's necessary for freezer to survive/recover after the machine crash
(e.g. power failure).
In linux, whenever the file write is performed, the file metadata update
and data update are
not necessarily performed at the same time. Typically, the metadata will
be flushed/journalled
ahead of the file data. Therefore, we make the pessimistic assumption
that the file is first
extended with invalid "garbage" data (normally zero bytes) and that
afterwards the correct
data replaces the garbage. 

We have observed that the index file of the freezer often contain
garbage entry with zero value
(filenumber = 0, offset = 0) after a machine power failure. It proves
that the index file is extended
without the data being flushed. And this corruption can destroy the
whole freezer data eventually.

Performing fsync after each write operation can reduce the time window
for data to be transferred
to the disk and ensure the correctness of the data in the disk to the
greatest extent.

---

**How can we maintain this guarantee without relying on fsync?**

Because the items in the index file are strictly in order, we can
leverage this characteristic to
detect the corruption and truncate them when freezer is opened.
Specifically these validation
rules are performed for each index file:

For two consecutive index items:

- If their file numbers are the same, then the offset of the latter one
MUST not be less than that of the former.
- If the file number of the latter one is equal to that of the former
plus one, then the offset of the latter one MUST not be 0.
- If their file numbers are not equal, and the latter's file number is
not equal to the former plus 1, the latter one is valid

And also, for the first non-head item, it must refer to the earliest
data file, or the next file if the
earliest file is not sufficient to place the first item(very special
case, only theoretical possible
in tests)

With these validation rules, we can detect the invalid item in index
file with greatest possibility.

--- 

But unfortunately, these scenarios are not covered and could still lead
to a freezer corruption if it occurs:

**All items in index file are in zero value**

It's impossible to distinguish if they are truly zero (e.g. all the data
entries maintained in freezer
are zero size) or just the garbage left by OS. In this case, these index
items will be kept by truncating
the entire data file, namely the freezer is corrupted.

However, we can consider that the probability of this situation
occurring is quite low, and even
if it occurs, the freezer can be considered to be close to an empty
state. Rerun the state sync
should be acceptable.

**Index file is integral while relative data file is corrupted**

It might be possible the data file is corrupted whose file size is
extended correctly with garbage
filled (e.g. zero bytes). In this case, it's impossible to detect the
corruption by index validation.

We can either choose to `fsync` the data file, or blindly believe that
if index file is integral then
the data file could be integral with very high chance. In this pull
request, the first option is taken.
holiman pushed a commit that referenced this pull request Nov 19, 2024
This pull request removes the `fsync` of index files in freezer.ModifyAncients function for 
performance gain.

Originally, fsync is added after each freezer write operation to ensure
the written data is truly transferred into disk. Unfortunately, it turns 
out `fsync` can be relatively slow, especially on
macOS (see #28754 for more
information). 

In this pull request, fsync for index file is removed as it turns out
index file can be recovered even after a unclean shutdown. But fsync for data file is still kept, as
we have no meaningful way to validate the data correctness after unclean shutdown.

---

**But why do we need the `fsync` in the first place?** 

As it's necessary for freezer to survive/recover after the machine crash
(e.g. power failure).
In linux, whenever the file write is performed, the file metadata update
and data update are
not necessarily performed at the same time. Typically, the metadata will
be flushed/journalled
ahead of the file data. Therefore, we make the pessimistic assumption
that the file is first
extended with invalid "garbage" data (normally zero bytes) and that
afterwards the correct
data replaces the garbage. 

We have observed that the index file of the freezer often contain
garbage entry with zero value
(filenumber = 0, offset = 0) after a machine power failure. It proves
that the index file is extended
without the data being flushed. And this corruption can destroy the
whole freezer data eventually.

Performing fsync after each write operation can reduce the time window
for data to be transferred
to the disk and ensure the correctness of the data in the disk to the
greatest extent.

---

**How can we maintain this guarantee without relying on fsync?**

Because the items in the index file are strictly in order, we can
leverage this characteristic to
detect the corruption and truncate them when freezer is opened.
Specifically these validation
rules are performed for each index file:

For two consecutive index items:

- If their file numbers are the same, then the offset of the latter one
MUST not be less than that of the former.
- If the file number of the latter one is equal to that of the former
plus one, then the offset of the latter one MUST not be 0.
- If their file numbers are not equal, and the latter's file number is
not equal to the former plus 1, the latter one is valid

And also, for the first non-head item, it must refer to the earliest
data file, or the next file if the
earliest file is not sufficient to place the first item(very special
case, only theoretical possible
in tests)

With these validation rules, we can detect the invalid item in index
file with greatest possibility.

--- 

But unfortunately, these scenarios are not covered and could still lead
to a freezer corruption if it occurs:

**All items in index file are in zero value**

It's impossible to distinguish if they are truly zero (e.g. all the data
entries maintained in freezer
are zero size) or just the garbage left by OS. In this case, these index
items will be kept by truncating
the entire data file, namely the freezer is corrupted.

However, we can consider that the probability of this situation
occurring is quite low, and even
if it occurs, the freezer can be considered to be close to an empty
state. Rerun the state sync
should be acceptable.

**Index file is integral while relative data file is corrupted**

It might be possible the data file is corrupted whose file size is
extended correctly with garbage
filled (e.g. zero bytes). In this case, it's impossible to detect the
corruption by index validation.

We can either choose to `fsync` the data file, or blindly believe that
if index file is integral then
the data file could be integral with very high chance. In this pull
request, the first option is taken.
zfy0701 pushed a commit to sentioxyz/go-ethereum that referenced this pull request Dec 3, 2024
This pull request removes the `fsync` of index files in freezer.ModifyAncients function for 
performance gain.

Originally, fsync is added after each freezer write operation to ensure
the written data is truly transferred into disk. Unfortunately, it turns 
out `fsync` can be relatively slow, especially on
macOS (see ethereum#28754 for more
information). 

In this pull request, fsync for index file is removed as it turns out
index file can be recovered even after a unclean shutdown. But fsync for data file is still kept, as
we have no meaningful way to validate the data correctness after unclean shutdown.

---

**But why do we need the `fsync` in the first place?** 

As it's necessary for freezer to survive/recover after the machine crash
(e.g. power failure).
In linux, whenever the file write is performed, the file metadata update
and data update are
not necessarily performed at the same time. Typically, the metadata will
be flushed/journalled
ahead of the file data. Therefore, we make the pessimistic assumption
that the file is first
extended with invalid "garbage" data (normally zero bytes) and that
afterwards the correct
data replaces the garbage. 

We have observed that the index file of the freezer often contain
garbage entry with zero value
(filenumber = 0, offset = 0) after a machine power failure. It proves
that the index file is extended
without the data being flushed. And this corruption can destroy the
whole freezer data eventually.

Performing fsync after each write operation can reduce the time window
for data to be transferred
to the disk and ensure the correctness of the data in the disk to the
greatest extent.

---

**How can we maintain this guarantee without relying on fsync?**

Because the items in the index file are strictly in order, we can
leverage this characteristic to
detect the corruption and truncate them when freezer is opened.
Specifically these validation
rules are performed for each index file:

For two consecutive index items:

- If their file numbers are the same, then the offset of the latter one
MUST not be less than that of the former.
- If the file number of the latter one is equal to that of the former
plus one, then the offset of the latter one MUST not be 0.
- If their file numbers are not equal, and the latter's file number is
not equal to the former plus 1, the latter one is valid

And also, for the first non-head item, it must refer to the earliest
data file, or the next file if the
earliest file is not sufficient to place the first item(very special
case, only theoretical possible
in tests)

With these validation rules, we can detect the invalid item in index
file with greatest possibility.

--- 

But unfortunately, these scenarios are not covered and could still lead
to a freezer corruption if it occurs:

**All items in index file are in zero value**

It's impossible to distinguish if they are truly zero (e.g. all the data
entries maintained in freezer
are zero size) or just the garbage left by OS. In this case, these index
items will be kept by truncating
the entire data file, namely the freezer is corrupted.

However, we can consider that the probability of this situation
occurring is quite low, and even
if it occurs, the freezer can be considered to be close to an empty
state. Rerun the state sync
should be acceptable.

**Index file is integral while relative data file is corrupted**

It might be possible the data file is corrupted whose file size is
extended correctly with garbage
filled (e.g. zero bytes). In this case, it's impossible to detect the
corruption by index validation.

We can either choose to `fsync` the data file, or blindly believe that
if index file is integral then
the data file could be integral with very high chance. In this pull
request, the first option is taken.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants