Description
After talking to Thomas Waldmann on IRC, I'm opening a ticket mostly to document this issue.
Currently, borg's model for storing chunks in the repository leaks information about the size and (potentially) directory structure of small files. Size and directory structure information can often be used to determine information about the content of the files themselves, through various fingerprint / watermark attacks.
Files in borg are chunked, compressed, and then the chunks are stored in the repository using PUT commands. These PUT commands necessarily contain the size of the chunk being stored, and additionally the segment files in the repository contain these sizes in easily parsed form.
For small files, the size of the chunks prior to compression will typically just be the size of the file itself. Additionally, a great many modern file formats are not compressible, and in any event an attacker can simply compress the data too. In this way, the chunk sizes can leak information about the size of small files.
Additionally, chunks will tend to be stored in temporal order in their associated segment files. Since the order borg processed the files in will be some directory ordering, the temporal order in the repository should typically match a directory sort order, which leaks additional information.
As a hypothetical example of how size and directory structure information can be used as a data fingerprint, consider the common method of fingerprinting CD's based on their track length. Even if a CD was encoded to mp3, it's typically possible to convert file size back to track length by assuming a standard bitrate.
It should then be evident that a database of CD track lengths is sufficient to show that a directory of encrypted files contains the CD tracks with high confidence, just by comparing the ratios of the file sizes to the ratios of track lengths. This can be done without reading any of the data, just the file sizes.
Many other more advanced sorts of watermarking attacks are possible.
There are various potential ways to solve this information leak. A few ideas:
- Pad small files up to the chunk size.
- Bundle multiple files into a chunk somehow.
- More generally, differentiate between segment records (known to the remote server) and chunks (known to the client). Make the records all the same size.
Regardless of whether this is fixed any time soon, it seemed worth documenting better. The current documentation states "[...] all your files/dirs data and metadata are stored in their encrypted form into the repository" so a leak of file size and directory order information seems notable.