-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discrete-Incremental Hashing #49
Comments
This is due to Java's UTF16-like in-memory representation of strings which uses 2 bytes per character. |
I will do some benchmarks to check if this makes a difference. This is an optimization, which the Java compiler could probably do automatically. |
Thanks. Didn't notice the char values are UTF-16. |
The incremental approach might be an interesting alternative. If I find the time I will do an alternative implementation and look at the benchmarks. I am not an expert in designing hash functions, but I am wondering if this incremental approach might lead to a worse hash-quality? My gut feeling is that it is better to serialize an object and then apply a hash function once than to apply the hash function recursively on chunks. After all, there must be a reason why most hash functions use an internal state that is greater than 8 bytes. |
The quality is fine. Komihash has 128-bit "state space", so collisions do not happen. Anyway, I've tested the approach in SMHasher without issues. You can optimize incremental hashing for char, byte, short, long value inputs quite a bit - instead of calling a single static hash function you can add "truncated" implementations for specific input lengthes. |
What I mean is that only 64 bits of information is transferred from one chunk to another with the incremental aproach. If everything is hashed at once, there is much more information that can somehow interact due to the larger internal state of the hash function. |
Well, look at hash value as a "starting point". The next hashing round moves this "starting point" to another position. This may not work with other hashes, but it works with |
Understood. Anyway, it looks cleaner that way I think. |
reopening this issue as it was unintentionally closed |
@avaneev, after implementing the streaming variant (see avaneev/komihash#8), do you think the incremental approach is still worth considering? I tend to keep the current approach, which is also more safe for other hash algorithms that do not have the required properties for the incremental approach. |
Discrete-incremental approach is very useful for database hashing, not only because it is faster, but also because it implicitly separates the data items. For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes. Discrete-incremental hashing will produce two different hashes in this case. |
This is why hash4j provides methods that append the lengths of variable-length fields to make hashes unique. See for example: hash4j/src/main/java/com/dynatrace/hash4j/hashing/HashSink.java Lines 287 to 306 in 17bfe55
|
Yes, I understand. But the approach is not ideal, because length is a binary value and in some cases if you append an independent binary value after the string, a collision may happen anyway. |
Using the binary representation after serializing the object avoids hash collisions by definition. Appending the lengths of all variable length fields of an object is equivalent to serialization, since the object can be reconstructed from this information. Therefore, the hash4j approach should be fine. |
I think this leaves a chance to programming error. Beside that, appending string's length is an unnecessary overhead compared to discrete-incremental approach. |
For nested data structures like a list of lists (e.g. |
For nested structures, it will be enough to add zero-length hashing round |
Do you mean that
would be hashed as
where the But then
with an empty string in the middle would give a hash collision. |
It should be more like But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given. |
So, a sub-hasher is probably needed that will look like: |
In my proposal I've assumed zero-terminated strings, so their lengths are non-zero even when they are empty. But that's also not perfect, because individual zero binary bytes can also happen in a hash stream. |
When lengths (also those of the strings) are always appended, it should be fine:
would lead to a binary representation corresponding to
wheras
would give
Of course, for tiny strings there is a lot of overhead, but in general it is a safe strategy which works with just a single hasher instance.
This is what I meant with "another hasher instance". This strategy requires creating further instances which also comes with some overhead (especially in Java). Hashing a nested data structure requires as many hasher instances as nesting levels. |
I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only. Your example above collides with |
Agree, but this could be difficult to realize in Java without additional overhead or object allocations.
This is different, as this is a list of some polymorphic type (able to represent integers and strings). If you hash a polymorphic object you always have to include the type information. Then there will be no collision. |
I think that additional overhead for hash value stack would be much lower than the overhead associated with storing string lengths and types (for polymorphic objects), and buffering of course. Discrete-incremental capability may be a new word in high-performance database hashing since it is fast, universal, and type-agnostic. |
I think type information is also necessary for the incremental approach in case of polymorphic types. Since, as you mentioned, Komihash might be the only algorithm that has the properties required for this incremental approach, I still think it is better to use the current approach by default. However, I see the potential and will consider adding this incremental approach as an alternative in the future. I will keep this issue open in the meanwhile. You mentioned that there are statistical tests that prove that Komihash produces equally good hashes when it is run incrementally (e.g. when only one byte is added in each hashing round). Are such tests already part of smhasher? |
No, it is not in SMHasher. Here's the code you can try if you have SMHasher, simply change komihash() to komihash_di() call in SMHasher code. (its hash identifier differs, though)
|
I've noticed that your Java implementation of
komihash
uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.But anyway, please take a note that I've found and tested an "incremental" hashing approach with
komihash
, it's described on the project page. You may consider switching your hash streamer to this incremental approach. It does not require pre-buffering and thus it's quite a lot more efficient for strings, and even small values like char, int, long. Also note that for register usage efficiency it's preferrable to saver1l
,r2l
,r3l
,r4l
multiplication results into correspondingSeed1..4
values, and then XOR them withSeed5-8
values.To adopt the "incremental" hashing you'll need to implement a single "static" hashing function, without loading and storing the internal state. You'll just need to store a single "lastHash" value. It will likely be much more efficient than your current implementation.
The text was updated successfully, but these errors were encountered: