Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set zstd compression level to 1 as it offers fastest compression with small size tradeoff. #2729

Merged
merged 1 commit into from
Sep 9, 2024

Conversation

ksolana
Copy link

@ksolana ksolana commented Aug 24, 2024

As per https://github.com/facebook/zstd/tree/v1.5.2?tab=readme-ov-file#benchmarks compression levels can be tuned to our needs. Higher compression ratio typically take more time and vice versa. zstd has compression levels from 1..19 with default of 3 that zstd-rs uses. But zstd-rs provides a way to set compression level. We should set a level that is more suitable for our purposes.

For the context zstd compression runs continuously (for taking snapshots). ZSTD_compressBlock_doubleFast_extDict_generic is one of the hottest function consuming 2.54% of cycles, as per linux perf.

Based on the results below going from level 3 to 1 can give us speed up of almost 60%.

@ksolana ksolana changed the title Get higher compression ratio with -1 level Get faster compression level --fast=4 Aug 24, 2024
@ksolana ksolana marked this pull request as ready for review August 24, 2024 21:38
@brooksprumo
Copy link

Can you share the stats with and without this PR for creating an mnb snapshot? Both the time and the final archive's size, please.

@ksolana
Copy link
Author

ksolana commented Aug 30, 2024

so zstd has internal benchmarking mechanism which is probably more reliable. here are the results

ls -lh ledger/snapshots/286703340/286703340 
-rw-r--r-- 1 sol users 1.1G Aug 30 00:19 ledger/snapshots/286703340/286703340

zstd -b1 -e4  ledger/snapshots/286703340/286703340
 1#286703340         :1137565680 -> 486726512 (2.337), 397.5 MB/s ,2166.4 MB/s 
 2#286703340         :1137565680 -> 457581143 (2.486), 352.2 MB/s ,2132.8 MB/s 
 3#286703340         :1137565680 -> 455056224 (2.500), 257.7 MB/s ,2106.2 MB/s 
 4#286703340         :1137565680 -> 452603752 (2.513), 226.5 MB/s ,2087.8 MB/s 

@ksolana
Copy link
Author

ksolana commented Aug 30, 2024

Seems like level=2 might give the best tradeoff with 352.2 MB/s speed and minimal gain in size (1%)?

@brooksprumo
Copy link

so zstd has internal benchmarking mechanism which is probably more reliable. here are the results

ls -lh ledger/snapshots/286703340/286703340 
-rw-r--r-- 1 sol users 1.1G Aug 30 00:19 ledger/snapshots/286703340/286703340

zstd -b1 -e4  ledger/snapshots/286703340/286703340
 1#286703340         :1137565680 -> 486726512 (2.337), 397.5 MB/s ,2166.4 MB/s 
 2#286703340         :1137565680 -> 457581143 (2.486), 352.2 MB/s ,2132.8 MB/s 
 3#286703340         :1137565680 -> 455056224 (2.500), 257.7 MB/s ,2106.2 MB/s 
 4#286703340         :1137565680 -> 452603752 (2.513), 226.5 MB/s ,2087.8 MB/s 

We need to test the full snapshot package archival, which includes all the account storage files as well. It looks like these numbers are only for archiving the bank snapshot.

@ksolana
Copy link
Author

ksolana commented Aug 31, 2024

So i got the zstd results for a full archive of 302G

for i in {1..10}; do
  zstd -$i -vvvvv snapshot-286698324-ENHWfdq4kedwGiyitnmDAzFRvW6Vzwcmy1qCaGGLj7Nb.tar -o t.$i.zst
done

Filename t.1.zst represents compression at level 1 (zstd -1) from above and so forth. I tried levels 1..10.

Filename Size Time       Cpu load
t.1.zst  87G 621.63 sec  (cpu load : 137%)
t.2.zst  85G 718.25 sec  (cpu load : 135%)
t.3.zst  83G 994.02 sec  (cpu load : 124%)
t.4.zst  82G 1195.79 sec  (cpu load : 120%)
t.5.zst  82G 2274.97 sec  (cpu load : 111%)
t.6.zst  82G 2562.68 sec  (cpu load : 110%)
t.7.zst  81G 2976.86 sec  (cpu load : 108%)
t.8.zst  81G 3288.39 sec  (cpu load : 108%)
t.9.zst  81G 3308.86 sec  (cpu load : 107%)
t.10.zst 80G 4212.76 sec  (cpu load : 106%)

As we see the size delta between highest and lowest is less than 10% but time difference is quite high.

@steviez
Copy link

steviez commented Aug 31, 2024

I realize there is still some testing / conversation / etc to be done, but can you please give a more descriptive PR title ? Ie, mention that this is specific to zstd and/or snapshot in the title.

@ksolana ksolana changed the title Get faster compression level --fast=4 Set zstd compression level that is a good tradeoff between compression ratio and speed Aug 31, 2024
@brooksprumo
Copy link

Another option, should we provide a cli arg for setting the zstd compression level?

@brooksprumo
Copy link

Filename t.1.zst represents compression at level 1 (zstd -1) from above and so forth. I tried levels 1..10.

Filename Size Time       Cpu load
t.1.zst  87G 621.63 sec  (cpu load : 137%)
t.2.zst  85G 718.25 sec  (cpu load : 135%)
t.3.zst  83G 994.02 sec  (cpu load : 124%)
t.4.zst  82G 1195.79 sec  (cpu load : 120%)
t.5.zst  82G 2274.97 sec  (cpu load : 111%)
t.6.zst  82G 2562.68 sec  (cpu load : 110%)
t.7.zst  81G 2976.86 sec  (cpu load : 108%)
t.8.zst  81G 3288.39 sec  (cpu load : 108%)
t.9.zst  81G 3308.86 sec  (cpu load : 107%)
t.10.zst 80G 4212.76 sec  (cpu load : 106%)

As we see the size delta between highest and lowest is less than 10% but time difference is quite high.

My gut tells me that node operators would be willing to eat an additional 4 GB per full snapshot to save > 6 mins on the time to archive.

I'm in favor of setting the default level to 1 then.

@ksolana
Copy link
Author

ksolana commented Sep 3, 2024

Another option, should we provide a cli arg for setting the zstd compression level?

Good point! So maybe we set the default to 1 (best speed) and provide a cli for those who want to customize?

@brooksprumo
Copy link

So maybe we set the default to 1 (best speed) and provide a cli for those who want to customize?

This works for me!

@ksolana ksolana changed the title Set zstd compression level that is a good tradeoff between compression ratio and speed Set zstd compression level to 1 as it offers fastest compression with small size tradeoff. Sep 3, 2024
@ksolana
Copy link
Author

ksolana commented Sep 3, 2024

So maybe we set the default to 1 (best speed) and provide a cli for those who want to customize?

This works for me!

The cli option will be in a separate patch.

@brooksprumo
Copy link

Code looks good to me. I've restarted two nodes with this PR to observe system metrics and ensure there are no regressions elsewhere.

@apfitzge
Copy link

apfitzge commented Sep 4, 2024

the encoding time is significantly different for the different levels, but what about the decoding/unpacking time? Is that changed at all by the compression level?

@brooksprumo
Copy link

I have two nodes that were previously running master, then midway switched to running this PR. These nodes have different configurations, so I was interested to see how their overall metrics compared.

FYI, on the graphs below, the cursor represents about when the nodes were restarted to run this PR. So comparison is mainly left half (master) vs right half (this PR).

First, here's the time to archive snapshot packages:
time

for full snapshots:

blue: from ~1700 secs to ~1050 secs, saving ~10 minutes
purple: from ~1450 secs to ~950 secs, saving ~8 minutes

These are quite significant speedups. The archiving of a full snapshot doesn't really block much, other than serializing/archiving the next snapshot packages. But this can make graceful shutdown faster.

(Note that the updating of the latest full snapshot slot used by clean has already done in AccountsBackgroundService, well before we get here to archiving the snapshot package. So this speedup doesn't improve/impact clean.)

Incremental snapshots are also archived faster, but their absolute savings are much less, as expected.

Second, here's the size of the snapshot archives:
size

For full snapshots, the size increase looks to be between 4 and 5 GB.
For incremental snapshots, the size increase is negligible.

@ksolana
Copy link
Author

ksolana commented Sep 6, 2024

the encoding time is significantly different for the different levels, but what about the decoding/unpacking time? Is that changed at all by the compression level?

The decompression is quite fast compare to compression. Here are the timing for levels 1,2,3

+ zstd -vvvv -d ../t.1.zst -o t.1.extracted
../t.1.zst          : 323320815104 bytes

real    6m40.757s
user    2m39.600s
sys     3m50.743s

+ zstd -vvvv -d ../t.2.zst -o t.2.extracted

../t.2.zst          : 323320815104 bytes

real    7m15.833s
user    2m41.090s
sys     4m25.834s

+ zstd -vvvv -d ../t.3.zst -o t.3.extracted
../t.3.zst          : 323320815104 bytes

real    6m54.538s
user    2m39.087s
sys     4m5.262s

Copy link

@brooksprumo brooksprumo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me.

Please wait to merge until @jeffwashington also approves.

One additional thing to consider is the bootstrap process. Since this increases the size of snapshot archives, it means other nodes will have a longer time downloading a snapshot. I think this is ok for the following reasons:

  • Downloading snapshots should be very rare—if ever—, especially for staked/experienced validators. We should optimize for the common case, which is validators archiving snapshots. Not downloading nor booting from snapshot archives.
  • On mnb today, the v1.18 snapshots are ~87 GB. In v2.0+, these sizes have reduced significantly. With this PR, we increase them some, but still much lower than 87 GB. So I think the added size with this PR is OK.

Copy link

@jeffwashington jeffwashington left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. minor tradeoffs

@ksolana ksolana merged commit c7e44c1 into anza-xyz:master Sep 9, 2024
40 checks passed
@ksolana ksolana deleted the zstd branch September 9, 2024 22:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants