Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add delta file creation support #44

Open
xezon opened this issue Jun 23, 2024 · 0 comments
Open

Add delta file creation support #44

xezon opened this issue Jun 23, 2024 · 0 comments
Labels
enhancement New feature or request

Comments

@xezon
Copy link
Contributor

xezon commented Jun 23, 2024

To avoid potential copyright issues and decrease the size of distributions, one may want to use delta files for mass distribution. The delta files are useless on their own, but can transform into a useful file if matched with their intended companion file on the target computer.

Let us say we have file B, derived from source file A. What we do is, match derivative file B against source file A, build a diff file by linking to substitute data from source file A, then distribute that diff file, which is then rebuilt to a fully functional file against source file A, assuming it exists on the target computer.

Algorithm

If we were to implement a custom tool for this, then it definitely would need to be implemented in C++ for speed. It can heavily utilize threads to speed up binary matching by delegating sectors of a file to different threads. Then it can perform fuzzy binary matching in appropriate steps, to try find suitable matches between files. Search can start with small patterns from one file to another, then expand to search left and right once the starter pattern is found. The biggest overlaps win the matching if there are multiple overlapping ones in ranges. This algorithm can then be tweaked for performance.

Asking ChatGPT for existing solutions

begin question

Does a tool exist that can binary match any two files A and B, so that if B is a revision of A, it builds a new smaller file C with sufficient information, to be able to recreate file B with files A and C?

end question

ChatGPT reply

begin reply

Yes, there are tools designed to perform this kind of binary diff and patch operation. These tools compare two files, A and B, and generate a smaller file C (often called a delta or patch) that contains the differences between A and B. Using A and the patch file C, one can reconstruct B. Here are some of the tools that perform this function:

bsdiff:

  • Description: bsdiff is a binary diff tool that generates binary patches. It is efficient and particularly well-suited for large binaries.
  • Reconstruction: You can use bspatch to apply the patch and recreate file B from A and C.
  • Command Example:
    bsdiff fileA fileB fileC
    bspatch fileA fileC fileB_reconstructed

xdelta:

  • Description: xdelta is another binary diff and patch tool that works well with binary files and is often used for compressing data by only storing changes between files.
  • Reconstruction: The xdelta command can apply the delta to the original file.
  • Command Example:
    xdelta3 -e -s fileA fileB fileC
    xdelta3 -d -s fileA fileC fileB_reconstructed

zsync:

  • Description: zsync is primarily used for synchronizing files over a network and is very efficient in terms of bandwidth usage. It creates a small metadata file that can be used to update the original file to the new version.
  • Reconstruction: Typically used in network file synchronization, but can also be used locally.
  • Command Example:
    zsyncmake fileB -o fileC
    zsync fileC

Rsync:

  • Description: While rsync is commonly used for remote synchronization, it can also be used locally to generate and apply binary diffs.
  • Reconstruction: rsync uses a rolling checksum algorithm to efficiently transfer changes.
  • Command Example:
        rsync --only-write-batch=fileC fileA fileB
        rsync --read-batch=fileC fileB_reconstructed

Conclusion

These tools are highly effective in scenarios where you need to minimize the storage or transmission size of file updates, such as in software distribution, backup systems, and version control systems. Each tool has its own strengths, so the choice of tool may depend on your specific requirements, such as the size of the files, the typical size of changes, and the environment in which you're working.

end reply

@xezon xezon added the enhancement New feature or request label Jun 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant