are 128 bit murmurhash3 hashes of data shorter than 127 bits collision free? #73

midjji · 2019-09-03T09:42:27Z

I tried to test but it might take a while...

Spencer-Doak · 2019-09-03T22:23:46Z

Not that I can speak authoritatively on the matter, but I would hazard a guess that the output is not necessarily collision free. In 2012, some researchers scrutinized MurmurHash 3 as if it was a cryptographically strong hashing algorithm. They were able to "break" the hashing algorithm such that they could produce collisions within the output regardless of the seed used. This is noted on the MurmurHash Wikipedia page, which cites a blog post that documents the vulnerability. Of course, though, MurmurHash was never designed to be a cryptographic hashing algorithm, so one would not expect it to hold up to that level of security. I have not read over the details of the vulnerability myself to see what input sizes they were using to construct the collisions. Nonetheless, perhaps that source can help you find an answer to your question.

midjji · 2019-09-03T23:28:33Z

It isnt, but its an easy enough fix.

…

On Wed, 4 Sep 2019, 00:23 Spencer, ***@***.***> wrote: Not that I can speak authoritatively on the matter, but I would hazard a guess not. In 2012, some researchers scrutinized MurmurHash 3 as if it was a cryptographically strong hashing algorithm. They were able to "break" the hashing algorithm such that they could produce collisions within the output regardless of the seed used. This is noted on the MurmurHash Wikipedia page <https://en.wikipedia.org/wiki/MurmurHash#Vulnerabilities>, which cites a blog post that documents the vulnerability <https://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/>. Of course, though, MurmurHash was never designed to be a cryptographic hashing algorithm, so one would not expect it to hold up to that level of security. I have not read over the details of the vulnerability myself to see what input sizes they were using to construct the collisions. Nonetheless, perhaps that source can help you find an answer to your question. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#73?email_source=notifications&email_token=ABJQYJPWTKXSUFVYP3NJVXTQH3PXRA5CNFSM4ITEFHE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5ZYGUA#issuecomment-527663952>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJQYJN2KHO7MJ6O6HLL26LQH3PXRANCNFSM4ITEFHEQ> .

aappleby · 2019-09-03T23:38:08Z

Murmur's not a crypto hash, so it won't resist intentionally trying to generate collisions.

That said, its mixing is thorough enough that in general use you should be able to use any subset of the output bits and get uniform distributions.

midjji · 2019-09-04T00:21:26Z

That is no reason why the checksum should have collisions for short strings. Or why any string size 16n should collide with the same string + up to any 15 bytes. This isn't about cryptography, this is about minimizing collisions when using the hash as a checksum. It can be trivially invertible, just as long as small changes make it as likely to collide as large changes. Fixing it is trivial, at least for up to 124 bits. Just put the data first, then write the number of bytes written at the end. Add a rotate to avoid confusing people who don't expect the hash to be an identity mapping. If I wanted a random uniform distribution checksum I'd just take the full size string modulo the random seed size enough times to match the number of bits in the checksum using std random. The murmurhash looks alot like a mersenne twister after all. But the point is that checksums are faster. Also that the standard does not guarantee that the random engines remain seed consistent over versions or implementations.

…

On Wed, 4 Sep 2019, 01:38 aappleby, ***@***.***> wrote: Murmur's not a crypto hash, so it won't resist intentionally trying to generate collisions. That said, its mixing is thorough enough that in general use you should be able to use any subset of the output bits and get uniform distributions. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#73?email_source=notifications&email_token=ABJQYJJJESB4YQWISUXISN3QH3YONA5CNFSM4ITEFHE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5Z4IXY#issuecomment-527680607>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABJQYJLQEMOKEHOKULZDNCLQH3YONANCNFSM4ITEFHEQ> .

sebres · 2019-09-04T12:56:13Z

That is no reason why the checksum should have collisions for short
strings.

There is no reasons to hash something shorter as 128 bit using 128-bit hash, unless your subset contains also longer strings, but then you should take care about the collisions in the whole subset and not about the short strings only.

This isn't about cryptography, this is about minimizing collisions when using the hash as a checksum.

Well, it is not (really). But again, then the count of possible collisions is interesting for the whole subset and the hashing with such large hash on subset containing only short strings does not make sense at all, no matter for which purposes one would do that.
Usage of 128-bit hash as a checksum for values up to 128-bit is irrational.
Unless we'd speak about that in cryptographic sense (and as already said above mm3 is not a matter for that).

Fixing it is trivial, at least for up to 124 bits. Just put the data first, then write the number of bytes written at the end.

I don't think such "fix" is expected at all, because "stealing" (overwriting with length) of some bits would not necessary improve the algorithm in sense of distribution quantity.

But the question is rather, if you nevertheless need it:
Why you cannot do this by yourself before supply your string (buffer) to MM3 hashing?
Or either why you cannot use your length as a seed (all MM3 algorithms have parameter seed)?

smhasher/src/MurmurHash3.h

Lines 31 to 33 in 92cf370

    
           void MurmurHash3_x86_128 ( const void * key, int len, uint32_t seed, void * out ); 
        
           void MurmurHash3_x64_128 ( const void * key, int len, uint32_t seed, void * out );

Additionally note:

it would be then another algorithm;
it can produce some unexpected impact on some other strings (you'd simply relocate the collisions from one strings to another) so previously good distribution on some subsets could get worse (remember due to birthday paradox you've only 2⁶⁴ to catch a collision on directly distributed 128-bit hash if we speak about random data).
if you don't have random data (e. g. something continuously growing) you don't need a hash but rather some mixer algorithm; but again 128-bit checksum makes few sense here, either it depends more on your subset.
mm3-128 has basically good distribution also on short values (and the length of string is involved indirectly), for example here is a result of hashing \x00 bytes (with length N [1 .. 16], without using of length as seed):

 N |                            Bytes | MM3-128 (x86)
---|----------------------------------|---------------------------------
 1 |                               00 | 88c4adec54d201b954d201b954d201b9
 2 |                             0000 | 04a872bbedcd774bedcd774bedcd774b
 3 |                           000000 | e0d93642acf40e87acf40e87acf40e87
 4 |                         00000000 | cc066f1f9e5178409e5178409e517840
 5 |                       0000000000 | 50a68ecfd01a6609d01a6609d01a6609
 6 |                     000000000000 | 777fa95660bde92360bde92360bde923
 7 |                   00000000000000 | 0d45d85efb848988fb848988fb848988
 8 |                 0000000000000000 | e028ae414772b0844772b0844772b084
 9 |               000000000000000000 | 5ad58a7e543371085433710854337108
10 |             00000000000000000000 | 64010da262e8bc1762e8bc1762e8bc17
11 |           0000000000000000000000 | 2f35ebd169f8166569f8166569f81665
12 |         000000000000000000000000 | 332d18d156b5986456b5986456b59864
13 |       00000000000000000000000000 | 583cbe60ca53c80fca53c80fca53c80f
14 |     0000000000000000000000000000 | a8e046b5855ca909855ca909855ca909
15 |   000000000000000000000000000000 | 3553d0af909796639097966390979663
16 | 00000000000000000000000000000000 | 5a4075d66b2d3d27d3926c2feb228a07

 N |                            Bytes | MM3-128 (x64)
---|----------------------------------|---------------------------------
 1 |                               00 | 4610abe56eff5cb551622daa78f83583
 2 |                             0000 | 3044b81a706c5de818f96bcc37e8a35b
 3 |                           000000 | 79d54dd1bf7137480af5e7f1b766291d
 4 |                         00000000 | cfa0f7ddd84c76bc589623161cf526f1
 5 |                       0000000000 | 3df460ff3e17b53a17874fba56e69767
 6 |                     000000000000 | 7d480f9fa80ec469719af4070b74d89d
 7 |                   00000000000000 | f402c55ac5dec98f2de586f681711c02
 8 |                 0000000000000000 | 28df63b7cc57c3cbf2557dfcc4e8fe52
 9 |               000000000000000000 | 73269217e5476f20f1fa3fc86728ca0c
10 |             00000000000000000000 | 5b3d684f8c57ce161ba63bef94931146
11 |           0000000000000000000000 | 056e0d6c8921404673c2da0104c39955
12 |         000000000000000000000000 | a4d8ece9d7c0dfe3803bbf8eb6f0853f
13 |       00000000000000000000000000 | a10ea8b22762995abb1575409cfb7dc6
14 |     0000000000000000000000000000 | 028b7708fcbbed1e8393f0698afe46ea
15 |   000000000000000000000000000000 | 6ce113b115a56871195953c2230f8db2
16 | 00000000000000000000000000000000 | 4bbd1bf27da918d6b465a9eccd791cb6

karanlyons mentioned this issue Jan 10, 2020

hash128 not matching python karanlyons/murmurHash3.js#8

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

are 128 bit murmurhash3 hashes of data shorter than 127 bits collision free? #73

are 128 bit murmurhash3 hashes of data shorter than 127 bits collision free? #73

midjji commented Sep 3, 2019

Spencer-Doak commented Sep 3, 2019 •

edited

Loading

midjji commented Sep 3, 2019 via email

aappleby commented Sep 3, 2019

midjji commented Sep 4, 2019 via email

sebres commented Sep 4, 2019

are 128 bit murmurhash3 hashes of data shorter than 127 bits collision free? #73

are 128 bit murmurhash3 hashes of data shorter than 127 bits collision free? #73

Comments

midjji commented Sep 3, 2019

Spencer-Doak commented Sep 3, 2019 • edited Loading

midjji commented Sep 3, 2019 via email

aappleby commented Sep 3, 2019

midjji commented Sep 4, 2019 via email

sebres commented Sep 4, 2019

Spencer-Doak commented Sep 3, 2019 •

edited

Loading