-
Notifications
You must be signed in to change notification settings - Fork 469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
are 128 bit murmurhash3 hashes of data shorter than 127 bits collision free? #73
Comments
Not that I can speak authoritatively on the matter, but I would hazard a guess that the output is not necessarily collision free. In 2012, some researchers scrutinized MurmurHash 3 as if it was a cryptographically strong hashing algorithm. They were able to "break" the hashing algorithm such that they could produce collisions within the output regardless of the seed used. This is noted on the MurmurHash Wikipedia page, which cites a blog post that documents the vulnerability. Of course, though, MurmurHash was never designed to be a cryptographic hashing algorithm, so one would not expect it to hold up to that level of security. I have not read over the details of the vulnerability myself to see what input sizes they were using to construct the collisions. Nonetheless, perhaps that source can help you find an answer to your question. |
It isnt, but its an easy enough fix.
…On Wed, 4 Sep 2019, 00:23 Spencer, ***@***.***> wrote:
Not that I can speak authoritatively on the matter, but I would hazard a
guess not. In 2012, some researchers scrutinized MurmurHash 3 as if it was
a cryptographically strong hashing algorithm. They were able to "break" the
hashing algorithm such that they could produce collisions within the output
regardless of the seed used. This is noted on the MurmurHash Wikipedia
page <https://en.wikipedia.org/wiki/MurmurHash#Vulnerabilities>, which
cites a blog post that documents the vulnerability
<https://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/>.
Of course, though, MurmurHash was never designed to be a cryptographic
hashing algorithm, so one would not expect it to hold up to that level of
security. I have not read over the details of the vulnerability myself to
see what input sizes they were using to construct the collisions.
Nonetheless, perhaps that source can help you find an answer to your
question.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#73?email_source=notifications&email_token=ABJQYJPWTKXSUFVYP3NJVXTQH3PXRA5CNFSM4ITEFHE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ZYGUA#issuecomment-527663952>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABJQYJN2KHO7MJ6O6HLL26LQH3PXRANCNFSM4ITEFHEQ>
.
|
Murmur's not a crypto hash, so it won't resist intentionally trying to generate collisions. That said, its mixing is thorough enough that in general use you should be able to use any subset of the output bits and get uniform distributions. |
That is no reason why the checksum should have collisions for short
strings.
Or why any string size 16n should collide with the same string + up to any
15 bytes.
This isn't about cryptography, this is about minimizing collisions when
using the hash as a checksum. It can be trivially invertible, just as long
as small changes make it as likely to collide as large changes.
Fixing it is trivial, at least for up to 124 bits. Just put the data first,
then write the number of bytes written at the end. Add a rotate to avoid
confusing people who don't expect the hash to be an identity mapping.
If I wanted a random uniform distribution checksum I'd just take the full
size string modulo the random seed size enough times to match the number of
bits in the checksum using std random. The murmurhash looks alot like a
mersenne twister after all. But the point is that checksums are faster.
Also that the standard does not guarantee that the random engines remain
seed consistent over versions or implementations.
…On Wed, 4 Sep 2019, 01:38 aappleby, ***@***.***> wrote:
Murmur's not a crypto hash, so it won't resist intentionally trying to
generate collisions.
That said, its mixing is thorough enough that in general use you should be
able to use any subset of the output bits and get uniform distributions.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#73?email_source=notifications&email_token=ABJQYJJJESB4YQWISUXISN3QH3YONA5CNFSM4ITEFHE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5Z4IXY#issuecomment-527680607>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABJQYJLQEMOKEHOKULZDNCLQH3YONANCNFSM4ITEFHEQ>
.
|
There is no reasons to hash something shorter as 128 bit using 128-bit hash, unless your subset contains also longer strings, but then you should take care about the collisions in the whole subset and not about the short strings only.
Well, it is not (really). But again, then the count of possible collisions is interesting for the whole subset and the hashing with such large hash on subset containing only short strings does not make sense at all, no matter for which purposes one would do that.
I don't think such "fix" is expected at all, because "stealing" (overwriting with length) of some bits would not necessary improve the algorithm in sense of distribution quantity. But the question is rather, if you nevertheless need it: Lines 31 to 33 in 92cf370
Additionally note:
|
I tried to test but it might take a while...
The text was updated successfully, but these errors were encountered: