Skip to content

Conversation

@sven-ola
Copy link

I have some ARM boards (32 and 64 bit) that needed massaging...

@nethappen
Copy link
Owner

Thanks for reporting the issue on ARM and for submitting fixes related to the max file/device size bug and some minor code errors.

You're right — on 32-bit systems, size_t limits the maximum size to 4 GiB. While off_t can be adjusted with _FILE_OFFSET_BITS 64, there's no such option for size_t.

Your code works well — I’ll review it again. I think using uint64_t for data_size and any other counters (tracking processed bytes) would be a clean and portable solution.

Out of curiosity: what hardware/system are you using? I'd like to emulate ARM and make sure the final version is fully compatible.

Glad to hear blocksync-fast is useful on such platforms — thanks again!

@sven-ola
Copy link
Author

sven-ola commented May 18, 2025

Hi! Thanks for coming back to that PR. I may have introduced extra bugs, so please review. Note, that I use "-O3" because this shows up to 10 times acceleration on XXH benchmarks compared to standard "-O0" which is a bit suspicious. For that reason I added some md5extra feature branch in my repo (not meant for some next PR).

Some background: I'm trying to start a project "Nextcloud on small SBC for home users". For this, I am working with OrangePi Zeroes currently, those have 32bit and 64bit ARM cores. Those cute little things have SD cards for storage, thus prone to failures so we need some good backup concept obviously. Heres a photo of my test setup (green=32bit, black=64bit).

Imagepipe_9

For the Orange PIs, I currently plan to use blocksync-fast as nightly image backup. This uses Linux device mapper, LUKS encryption and stuff. May be enriched e.g. by "dm-era" if I understand how to integrate device mapper write logging with LVM2. Here's snippet of a concept do I wrote earlier if you are interested. Best // Sven-Ola

Backups (english)

The SD card holds two partitions:

  1. An unencrypted standard bootable Linux partition. This is basically an Armbian image with some added software packages. Thus, this partition (and the space before this partition) is the same for all Orange Pi gadgets. This partition also stores no user data.
  2. A second partition that is managed by LVM (Logical Volume Manager). The magic starts within the initramfs during boot. The initramfs is basically a small Linux which initializes the LVM partition with LUKS encryption on the very first startup. We use a standard key ("admin") on keyslot 6 and a generated key on keyslot 7. The generated key is stored in an extra storage device (MTD / 2 Mb NOR flash) which is available with Orange Pi devices. If there is no writeable MTD, the last bytes of the space before the first partition on the SD card is used.

Later during the user installation you need to set a backup / user password on keyslot 0 while deleting the standard password in keyslot 6. So if the SD card is removed from the Orange Pi gadget, you can decrypt the second partition with your user password, e.g. when inserting the SD card in another Linux PC.

Both the standard Linux partition and the encrypted LUKS partition are bound together as a united file system ("unionfs"). The standard Linux files resides in the lower part and the LUKS partition with the user files forms the overlay / upper part. All user data thus is stored encrypted. Also, we only need to backup the encrypted user data, because we can restore the underlying standard Linux simply by downloading and installing.

Note, that the upper /boot is a bind mount to the lower /boot. With this, updates to the kernel or to the initramfs / armbianEnv.txt are written to the standard Linux ready for the next boot. Also, swap files are incompatible with unionfs, so those needs to be stored outside the unionfs but whithin the encrypted area.

The backup is done as image backup of the LUKS partition. We use the program "blocksync-fast" for this. A nightly backup works basically with these steps:

  1. A cron job runs a backup script in the early morning.
  2. The backup script flushes the Nextcloud database and creates an LVM snapshot of the encrypted LUKS partition. Snapshot creation is a fast operation (~1 second).
  3. It is possible to change the snapshot after creation. So we remove keyslot 6 and 7 before starting the image backup.
  4. Now the snapshot is stored using "blocksync-fast" on a friendly storage transferred via SSH. "blocksync-fast" uses checksums ("digests") for image data blocks to minimize data transfers for subsequent backups. Thus, after a longer initial backup all further backups should be done in less time.
  5. If the backup is done, the LVM snapshot is removed.

Most computing resources are required to check all data blocks of the LVM snapshot images for changes that needs to be written to the backup image. This is done on the Orange Pi gadgets, thus utilization on the central storage device is relatively low. Because we store LVM LUKS images (which can only be decrypted with the user password) no further data security on the friendly backup server is required. With LVM snapshots we can save backups while writing user data concurrently, thus we only have some seconds downtime in the night and around 10-20 minutes of higher CPU utilization / slower answering times which should be tolerable anyhow.

@nethappen
Copy link
Owner

Hi,

Your project is very interesting — lots of challenges! I’ve read it with full engagement and understanding of how it works.

dm-era looks like a useful concept — something I was looking for back when I started working on blocksync-fast. However, from what I see, there aren’t really any ready-to-use backup tools that integrate with it directly, right? I do like the concept, and I’m actually considering integrating dm-era support into blocksync-fast in the future.

As I understand it, dm-era can report which blocks were modified — and that part wouldn’t be too difficult to implement. On the other hand, the current checksum-based approach also works well, with its own pros and cons. I assume dm-era, like any active snapshot-like layer, adds some overhead — though of course it depends on the use case. In most practical setups (e.g. once-daily backups), it’s often acceptable to just scan the entire volume and compare checksums during off-peak hours.

I do have a few trivial but curious questions:

  • Do you store the digest (checksums) on a separate LVM volume?

  • Do you use this kind of SSH-based transfer flow?
    $ blocksync-fast --make-delta -s /dev/vg1/vol1-snap -f /var/cache/backups/vol1.digest | ssh 192.168.1.115 'blocksync-fast --apply-delta -d /mnt/backups/vol1'

  • How do you deal with interrupted syncs (e.g. connection drop, device reboot, etc.)?

  • It’s a good idea to always work on a copy of the digest file during sync — e.g. cp --reflink vol1.digest vol1.digest.work, then replace the original file only after a successful exit (code 0).

I also have a set of improved examples/* which I haven’t published yet. But I’m thinking about making a separate project dedicated to a full automation script: handling errors, retries, reports, etc.


A quick note on XXH speed:

So far, I noticed one odd thing: since version blocksync-fast-1.0.6, hashing tests — but only for the XXH3 algorithm — became significantly slower:
Algo: XXH32 Hash size: 4 bytes Speed: 1665812 hashes/s Processing: 6.35 GiB/s
Algo: XXH64 Hash size: 8 bytes Speed: 3099911 hashes/s Processing: 11.83 GiB/s
Algo: XXH3LOW Hash size: 4 bytes Speed: 2254051 hashes/s Processing: 8.60 GiB/s
Algo: XXH3 Hash size: 8 bytes Speed: 2234088 hashes/s Processing: 8.52 GiB/s
Algo: XXH128 Hash size: 16 bytes Speed: 2214106 hashes/s Processing: 8.45 GiB/s

Tests from blocksync-fast-1.0.5:
Algo: XXH32 Hash size: 4 bytes Speed: 1672860 hashes/s Processing: 6.38 GiB/s
Algo: XXH64 Hash size: 8 bytes Speed: 3093823 hashes/s Processing: 11.80 GiB/s
Algo: XXH3LOW Hash size: 4 bytes Speed: 6113722 hashes/s Processing: 23.32 GiB/s
Algo: XXH3 Hash size: 8 bytes Speed: 6115811 hashes/s Processing: 23.33 GiB/s
Algo: XXH128 Hash size: 16 bytes Speed: 6154500 hashes/s Processing: 23.48 GiB/s

I'm still not sure what causes that performance regression.

@sven-ola
Copy link
Author

sven-ola commented May 18, 2025

Hey! Appreciate your feedback 😎 I have used this sunday to investigate dm-era and try to get some experience with my backup strategy. Those little things are bound together via a simple VPS, a couple of wireguard tunnels and some haproxy setup, so e.g. you can access them via https://bpi1.privat-in.de or https://opi2-2.privat-in.de (as long as they are switched on). I plan to deploy more than one backup server / NAS gadget, but currently I only have one. The NAS is also a bit flakey (sometimes HDD throws errors, sometimes CPU is too hot), so it's a good proving ground I presume.

I played today with dm-era using a blog entry from 2017 and try this out on a virtual box - see https://www.cloudandheat.com/en/block-level-data-tracking-using-davice-mappers-dm-era/ (recommend this as a good read). Not sure how to use this in a fail-safe manner / in production / with end users switching off gadgets at will.

OTOH: I currently use 32 GiB SD cards while utilizing only 50% to give the wear levelling of the cards a chance to survive. So we got 14 GiB snapshot images backup that need around 15 minutes to be read and checksummed in the night. Nobody prevents users from buying 256 GiB cards, thus there lurks a 2 hour checksum session to be optimized.

To answer you questions: currently, the digest is not stored in a separate LV. There are plenty of other changes during a day (writes to /var/log/journal...) so an extra LV seems not to have an advantage.

Yes, I use SSH transfers currently. The backup gadget has a ~/.ssh/authorized_keys that limits the actions, such as command="/usr/local/bin/privat-in-backup bpi1" ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAINh1i0f4mHnKniYcfwt/iTbq+HspvPHye8hGvYl8yRfn root@bpi1 with the script that limits the possible actions:

root@bpi2:~# cat /usr/local/bin/privat-in-backup 
#!/bin/sh

export PATH=/usr/local/bin:/usr/bin:/bin
export IMG_DIR=/images

get_digest_md5()
{
	case ${1} in "")
		echo "need digest filename" >&2
		exit 1
	;;esac
	if [ -s "${1}" ];then
	(
		dd if="${1}" bs=8 count=6 status=none
		dd if=/dev/zero bs=8 count=1 status=none
		dd if="${1}" bs=8 skip=7 count=57 status=none
		dd if="${1}" skip=1 status=none
	) | md5sum | sed 's, .*,,'
	fi
}

case ${SSH_ORIGINAL_COMMAND} in get-digest-md5)
	get_digest_md5 ${IMG_DIR}/${1}.digest
;;create-digest)
	if [ ! -s ${IMG_DIR}/${1}.digest ] || [ ${IMG_DIR}/${1}.img -nt ${IMG_DIR}/${1}.digest ];then
		blocksync-fast --make-digest --src=${IMG_DIR}/${1}.img --digest=${IMG_DIR}/${1}.digest --force
	fi
;;get-digest)
	if [ -s ${IMG_DIR}/${1}.digest ];then
		cat ${IMG_DIR}/${1}.digest
	fi
;;apply-delta)
	rm -f ${IMG_DIR}/${1}.digest
	blocksync-fast --apply-delta --dst=${IMG_DIR}/${1}.img
;;esac

The plan is to generate a digest on the backup side from time to time, comparing both digests and eventually download the backup side digest before sending a new delta. And you are right: I may need better interrupt / disaster handling on the sending side. Working on a digest copy sounds like an idea to be checked out.

Regarding the benchmark speeds: I compiled 1.0.5 on arm32 and compared to current (both compiled with -O3). I don't see a difference (first run is 1.0.5, then started my "fix32" version:

gcc  -DXXH_INLINE_ALL  -O3   -o blocksync-fast blocksync-fast.o utils.o globals.o common.o init.o benchmark.o digest_info.o -L/usr/lib/arm-linux-gnueabihf -lgcrypt -lxxhash  -lm
make[2]: Leaving directory '/usr/src/blocksync-fast-1.0.5/src'
make[2]: Entering directory '/usr/src/blocksync-fast-1.0.5'
make[2]: Leaving directory '/usr/src/blocksync-fast-1.0.5'
make[1]: Leaving directory '/usr/src/blocksync-fast-1.0.5'
root@bpi2:/usr/src/blocksync-fast-1.0.5# ./src/blocksync-fast --benchmark
Block size: 4096 bytes
Filling buffer with random data ...
Algo: SHA1           	Hash size:  20 bytes		Speed:     17583 hashes/s	Processing:  68.68 MiB/s
Algo: RMD160         	Hash size:  20 bytes		Speed:     11309 hashes/s	Processing:  44.18 MiB/s
Algo: MD5            	Hash size:  16 bytes		Speed:     28128 hashes/s	Processing: 109.88 MiB/s
Algo: TIGER          	Hash size:  24 bytes		Speed:      7772 hashes/s	Processing:  30.36 MiB/s
Algo: TIGER1         	Hash size:  24 bytes		Speed:      7782 hashes/s	Processing:  30.40 MiB/s
Algo: TIGER2         	Hash size:  24 bytes		Speed:      7775 hashes/s	Processing:  30.37 MiB/s
Algo: SHA224         	Hash size:  28 bytes		Speed:      7924 hashes/s	Processing:  30.95 MiB/s
Algo: SHA256         	Hash size:  32 bytes		Speed:      7917 hashes/s	Processing:  30.93 MiB/s
Algo: SHA384         	Hash size:  48 bytes		Speed:      6981 hashes/s	Processing:  27.27 MiB/s
Algo: SHA512         	Hash size:  64 bytes		Speed:      6976 hashes/s	Processing:  27.25 MiB/s
Algo: SHA512_224     	Hash size:  28 bytes		Speed:      6981 hashes/s	Processing:  27.27 MiB/s
Algo: SHA512_256     	Hash size:  32 bytes		Speed:      6978 hashes/s	Processing:  27.26 MiB/s
Algo: SHA3_224       	Hash size:  28 bytes		Speed:      7318 hashes/s	Processing:  28.59 MiB/s
Algo: SHA3_256       	Hash size:  32 bytes		Speed:      6857 hashes/s	Processing:  26.79 MiB/s
Algo: SHA3_384       	Hash size:  48 bytes		Speed:      5333 hashes/s	Processing:  20.83 MiB/s
Algo: SHA3_512       	Hash size:  64 bytes		Speed:      3739 hashes/s	Processing:  14.61 MiB/s
Algo: CRC32          	Hash size:   4 bytes		Speed:     75145 hashes/s	Processing: 293.54 MiB/s
Algo: CRC32_RFC1510  	Hash size:   4 bytes		Speed:     75192 hashes/s	Processing: 293.72 MiB/s
Algo: CRC24_RFC2440  	Hash size:   3 bytes		Speed:     72768 hashes/s	Processing: 284.25 MiB/s
Algo: WHIRLPOOL      	Hash size:  64 bytes		Speed:      1103 hashes/s	Processing:   4.31 MiB/s
Algo: GOSTR3411_94   	Hash size:  32 bytes		Speed:      2004 hashes/s	Processing:   7.83 MiB/s
Algo: STRIBOG256     	Hash size:  32 bytes		Speed:      1095 hashes/s	Processing:   4.28 MiB/s
Algo: STRIBOG512     	Hash size:  64 bytes		Speed:      1095 hashes/s	Processing:   4.28 MiB/s
Algo: BLAKE2B_160    	Hash size:  20 bytes		Speed:      7169 hashes/s	Processing:  28.00 MiB/s
Algo: BLAKE2B_256    	Hash size:  32 bytes		Speed:      7171 hashes/s	Processing:  28.01 MiB/s
Algo: BLAKE2B_384    	Hash size:  48 bytes		Speed:      7176 hashes/s	Processing:  28.03 MiB/s
Algo: BLAKE2B_512    	Hash size:  64 bytes		Speed:      7173 hashes/s	Processing:  28.02 MiB/s
Algo: BLAKE2S_128    	Hash size:  16 bytes		Speed:     10291 hashes/s	Processing:  40.20 MiB/s
Algo: BLAKE2S_160    	Hash size:  20 bytes		Speed:     10283 hashes/s	Processing:  40.17 MiB/s
Algo: BLAKE2S_224    	Hash size:  28 bytes		Speed:     10290 hashes/s	Processing:  40.20 MiB/s
Algo: BLAKE2S_256    	Hash size:  32 bytes		Speed:     10284 hashes/s	Processing:  40.17 MiB/s
Algo: SM3            	Hash size:  32 bytes		Speed:      6941 hashes/s	Processing:  27.11 MiB/s
Algo: XXH32          	Hash size:   4 bytes		Speed:    120638 hashes/s	Processing: 471.24 MiB/s
Algo: XXH64          	Hash size:   8 bytes		Speed:     90452 hashes/s	Processing: 353.33 MiB/s
Algo: XXH3LOW        	Hash size:   4 bytes		Speed:    118628 hashes/s	Processing: 463.39 MiB/s
Algo: XXH3           	Hash size:   8 bytes		Speed:    119028 hashes/s	Processing: 464.95 MiB/s
Algo: XXH128         	Hash size:  16 bytes		Speed:    116309 hashes/s	Processing: 454.33 MiB/s
root@bpi2:/usr/src/blocksync-fast-1.0.5# blocksync-fast --benchmark |grep XXH
Algo: XXH32          	Hash size:   4 bytes		Speed:    120612 hashes/s	Processing: 471.14 MiB/s
Algo: XXH64          	Hash size:   8 bytes		Speed:     90432 hashes/s	Processing: 353.25 MiB/s
Algo: XXH3LOW        	Hash size:   4 bytes		Speed:    118271 hashes/s	Processing: 462.00 MiB/s
Algo: XXH3           	Hash size:   8 bytes		Speed:    118577 hashes/s	Processing: 463.19 MiB/s
Algo: XXH128         	Hash size:  16 bytes		Speed:    116117 hashes/s	Processing: 453.58 MiB/s
root@bpi2:/usr/src/blocksync-fast-1.0.5# 

@nethappen
Copy link
Owner

Thanks for the links, I’ll study them carefully.
Let me know if you manage to get something working with dm-era.

Regarding the performance issue: I’ve compared all changes between versions, and there’s nothing that could explain the drop in speed, especially for just one algorithm: XXH3.

I even copied individual source files from different versions and tested them, but it still doesn’t make sense.
My guess is it’s the compiler acting weird. I’m using GCC 14 on this machine, which seems to optimize differently each time or maybe there are other random system-related factors. It’s hard to figure out. But I’m confident it’s not a bug in the code, because on other systems with newer version of blocksync-fast version with XXH3 works normal.


One tip if you want to compare digest files created on different machines:

$ blocksync-fast --make-digest -s backup-file -f remote.digest

and comparing it to local.digest (for example, on an ARM machine):
you need to skip the first 512 bytes of the digest (or delta) file before calculating a checksum:

dd if=remote.digest iflag=skip_bytes skip=512 | md5sum
dd if=local.digest iflag=skip_bytes skip=512 | md5sum

The reason is that the operation timestamp is stored in the header so checksums will differ even if the actual data blocks are the same. Caught myself with that one too. 😉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants