-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
b3sum: digest checksum for multiple files #171
Comments
Because That said, there might not be any particular reason to use extended output internally here. It probably makes more sense to just use the default 32-byte output from each individual file, and to only apply the A higher-level design question: Should this sort of thing include the file names in the hash? I think it probably has to. For one thing, the order of the hashes depends on the filenames, so at least some filename changes will change the final output. Also users of this feature presumably would want to use it to prove that an entire directory tree is as it should be, but being able to change (some) filenames without changing the hash would seriously break that use case. Thinking about how to unambigously represent a tree structure as a stream of bytes brings up some interesting questions. There are simple ways to do it, JSON for example. But with hashing in the picture, we often want to be sure that we're going to get a stable, canonical byte stream, which isn't something that JSON provides. We've had a couple of threads before about this "structured hashing" question (or whatever I should be calling it): Overall I think it's a very interesting and useful problem to solve, but that it's much trickier than it seems at first glance. It brings up security questions, as well as questions of backwards compatibility and protocol design. I'd like to try to tackle it at some point, but I think that'll end up being a nontrivial project in its own right. So, coming back down to earth a bit, we probably don't want to solve that general problem in Thoughts? |
I can see this making sense both ways: on one hand a 32-byte output is probably sufficient but there will undoubtedly be times where project requirements dictate that a larger length is used (I'm imagining governmental security related projects here). Of course, there would be no way to know what the internal length of the digest is so perhaps the point is moot.
Again, I can see it both ways. In most cases the file names will be important and therefore should be included in the digest, but what if one of the times
Why do we need to consider it as a tree structure? My thought was to simply add the files to the digest in the order they're generated. Perhaps with the parallelism issue from #170 the hashes would internally be stored in a tree map as they're generated in order to preserve the original ordering, then the tree walked and file hashes appended to the digest sequentially after all files had individually been hashed.
My use-case would be satisfied with being able to pipe the output of |
This is a common point of confusion, and maybe not documented well enough. Output lengths larger than the 32-byte default actually don't make the hash more secure. The underlying reason for this is that output size is one of the caps on security (thus making it shorter will reduce security), but it's not the only cap. Security is also capped by the amount of internal state we carry forward between invocations of the compression function, which is always 32-byte, regardless of the target output length (which might not even be known at compression time). |
Why not use |
For my original use-case it's to prove that files have not been modified during a transfer between systems. Adding them to an archive obviously does change the files; as such we need to generate the checksums in the original file structure. |
I see now. Thank you for clarifying. |
Just in case it interests anyone landing here, to get a checksum over a range of files individual checksums you can use find ./target-folder -type f -exec b3sum {} + | sort | b3sum --no-names That will recursively search for all files in the relative path The Original version (more verbose)find ./target-folder -type f -exec b3sum file1 another/file {} + \
| LC_ALL=C sort \
| b3sum \
| awk '{ print $1 }'
This example also demonstrates inclusion of two static filepaths already added from other locations outside of The rest of the functionality is described well in this article that I referenced, especially useful with the I'm not sure if |
I had a similar requirement but only needed the final hash. Wrote paq using the Supports Linux, MacOS, and Windows. |
Very practical usage, thank you! But is there a similar command on Windows? @polarathene |
I don't use Windows, but you can probably find a similar command to do the same:
You could maybe try with WSL? Or that |
From the future here, to the next future. The double space requirement between checksum and file got me here. I have blake3 checksums from TeraCopy and the files transfered from Windows to Linux over the internet. In this case I used SyncThing, no rsync, no robocopy and no Samba/Windows share. It is inconvenient to be in between 2 tools that just dont work with the same data, and then some glue has to be constructed, because.
From pwsh and Windows generated .blake3 etc (TeraCopy) on Unexpected input:
On Windows a filesystem name entry containing * or / \ should not be legal.
Syntax note: $blakefile = "windows.teracopy.blake3"
# Convert, foreach line send to file.
Get-Content $blakefile | % { $_.Replace('\','/').Replace(' *',' ') } | Out-File -Append -File "$blakefile.b3"
# Convert, foreach line pipe to b3sum.
Get-Content $blakefile | % { $_.Replace('\','/').Replace(' *',' ') } |
# Pipe direct to b3sum, read entire $blakefile using -Raw.
Get-Content $blakefile -Raw | % { $_.Replace('\','/').Replace(' *',' ') } | b3sum --check
# If you want there is Tee-Object, send to file and to b3sum
Get-Content $blakefile -Raw | % { $_.Replace('\','/').Replace(' *',' ') } | Tee-Object -FilePath "$blakefile.b3" | b3sum --check --quiet In case you need to tinker with checksum files thats not the right location for b3sum to find. Get-ChildItem -File -Recurse | %{ $.FullName } and the $.FullName property, and I'm sure there is a resolve relative path Resolve-Path -Path "/home/username/mydir/myfile" -RelativeBasePath "/home/username/" -Relative
# For convenience.
$cwd = "/home/username/"
$fullpath = "/home/username/mydir/myfile"
$remove = $cwd
Resolve-Path -Path "$fullpath" -RelativeBasePath "$remove" -Relative
# ./mydir/myfile |
Sorry could you be ab it more clear on what your comment was about? I can only infer the Windows adaptation? I mean I understand the path separator difference with Windows:
But the checksums being the same that you provided whilst having different paths was your concern? I haven't gone back through the prior history of this issue, but IIRC you have a content hash of the file, and a reference to the associated file in the 2nd column. So I don't see an issue with paths differing while keeping the same digest, could you clarify the specific problem there? Are you trying to use those paths afterwards? Path separators differing between windows and linux isn't always a concern, at least with Rust I recall it having good support to process / convert both. On Windows you can have WSL2 and access content on Windows or WSL2 from either end where obviously the path separator differs despite filesystem boundaries being crossed. |
This is a tangent from recent comments, but I want to point out to folks who might be following along that v1.5.0 of the |
I have a project with a requirement to output a hash for each file and an overall hash of the whole set of files at the end. This could be achieved with
b3sum file1 file2 file3 > filesums; b3sum filesums
but a built-in option would be nice too.I started work on this in daviessm@d6be566 but as a Rust newbie I don't know if I'm going down the right path with this.
The text was updated successfully, but these errors were encountered: