Discrete-Incremental Hashing #49

avaneev · 2022-11-11T13:32:46Z

I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.

But anyway, please take a note that I've found and tested an "incremental" hashing approach with komihash, it's described on the project page. You may consider switching your hash streamer to this incremental approach. It does not require pre-buffering and thus it's quite a lot more efficient for strings, and even small values like char, int, long. Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.

To adopt the "incremental" hashing you'll need to implement a single "static" hashing function, without loading and storing the internal state. You'll just need to store a single "lastHash" value. It will likely be much more efficient than your current implementation.

The text was updated successfully, but these errors were encountered:

oertl · 2022-11-11T13:46:37Z

I've noticed that your Java implementation of komihash uses 4-byte values when hashing a string (hashCharsToLong). This looks like a slow approach. Why is that? You can hash using 8-byte values.

This is due to Java's UTF16-like in-memory representation of strings which uses 2 bytes per character.

oertl · 2022-11-11T13:51:26Z

Also note that for register usage efficiency it's preferrable to save r1l, r2l, r3l, r4l multiplication results into corresponding Seed1..4 values, and then XOR them with Seed5-8 values.

I will do some benchmarks to check if this makes a difference. This is an optimization, which the Java compiler could probably do automatically.

avaneev · 2022-11-11T13:58:51Z

Thanks. Didn't notice the char values are UTF-16.

oertl · 2022-11-11T14:22:20Z

The incremental approach might be an interesting alternative. If I find the time I will do an alternative implementation and look at the benchmarks.

I am not an expert in designing hash functions, but I am wondering if this incremental approach might lead to a worse hash-quality? My gut feeling is that it is better to serialize an object and then apply a hash function once than to apply the hash function recursively on chunks. After all, there must be a reason why most hash functions use an internal state that is greater than 8 bytes.

avaneev · 2022-11-11T14:38:05Z

The quality is fine. Komihash has 128-bit "state space", so collisions do not happen. Anyway, I've tested the approach in SMHasher without issues.

You can optimize incremental hashing for char, byte, short, long value inputs quite a bit - instead of calling a single static hash function you can add "truncated" implementations for specific input lengthes.

oertl · 2022-11-11T14:44:13Z

What I mean is that only 64 bits of information is transferred from one chunk to another with the incremental aproach. If everything is hashed at once, there is much more information that can somehow interact due to the larger internal state of the hash function.

avaneev · 2022-11-11T14:47:58Z

Well, look at hash value as a "starting point". The next hashing round moves this "starting point" to another position. This may not work with other hashes, but it works with komihash due to its qualities. Whether you hash in a single pass or split the input into separate incremental byte-wide inputs, the hash quality remains the same, I've tested this.

oertl · 2022-11-14T08:09:09Z

@avaneev, I have implemented your suggestions (see #50), but could not see any performance improvement in my benchmarks. I assume that the Java compiler does this kind of optimizations automatically.

avaneev · 2022-11-14T08:48:29Z

Understood. Anyway, it looks cleaner that way I think.

oertl · 2022-12-07T09:43:55Z

reopening this issue as it was unintentionally closed

oertl · 2022-12-07T09:56:40Z

@avaneev, after implementing the streaming variant (see avaneev/komihash#8), do you think the incremental approach is still worth considering? I tend to keep the current approach, which is also more safe for other hash algorithms that do not have the required properties for the incremental approach.

avaneev · 2022-12-07T10:12:41Z

Discrete-incremental approach is very useful for database hashing, not only because it is faster, but also because it implicitly separates the data items. For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes. Discrete-incremental hashing will produce two different hashes in this case.

oertl · 2022-12-07T10:23:05Z

For example, if you hash two string-pairs - "aaa"+"bbbb" and "aaab"+"bbb" in a usual streamed manner, you'll get two identical hashes.

This is why hash4j provides methods that append the lengths of variable-length fields to make hashes unique. See for example:

hash4j/src/main/java/com/dynatrace/hash4j/hashing/HashSink.java

Lines 287 to 306 in 17bfe55

    
             /** 
        
              * Adds a string to the hash computation. 
        
              * 
        
              * <p>This method includes the length information. In this way, 
        
              * 
        
              * <p>{@code hashSink.putString}{@code ("AB").putString}{@code ("C")} 
        
              * 
        
              * <p>and 
        
              * 
        
              * <p>{@code hashSink.putString}{@code ("A").putString}{@code ("BC")} 
        
              * 
        
              * <p>will be different contributions to the hash value computation. 
        
              * 
        
              * <p>Equivalent to <br> 
        
              * {@code putChars(s).putInt(s.length());} 
        
              * 
        
              * @param s the string 
        
              * @return this 
        
              */ 
        
             HashSink putString(String s);

avaneev · 2022-12-07T10:24:51Z

Yes, I understand. But the approach is not ideal, because length is a binary value and in some cases if you append an independent binary value after the string, a collision may happen anyway.

oertl · 2022-12-07T11:03:57Z

Using the binary representation after serializing the object avoids hash collisions by definition. Appending the lengths of all variable length fields of an object is equivalent to serialization, since the object can be reconstructed from this information. Therefore, the hash4j approach should be fine.

avaneev · 2022-12-07T15:24:24Z

I think this leaves a chance to programming error. Beside that, appending string's length is an unnecessary overhead compared to discrete-incremental approach.

oertl · 2022-12-07T17:43:30Z

For nested data structures like a list of lists (e.g. List<List<Integer>>) the incremental approach is not straightforward either. Either one needs another hasher instance or one has to add the lengths of the inner lists.

avaneev · 2022-12-07T17:59:45Z

For nested structures, it will be enough to add zero-length hashing round Hash = komihash( &tmp, 0, Hash ); for each nesting. It's a data-less round, and it's very fast, at least in C.

oertl · 2022-12-07T20:51:34Z

Do you mean that

{{"ab", "cd"}, {"ef", "gh"}}

would be hashed as

komihash << "ab" << "cd" << "" << "ef" << "gh"

where the <<-operator denotes a hashing round and "" indicates a zero-length hashing round?

But then

{{"ab", "cd", "", "ef", "gh"}}

with an empty string in the middle would give a hash collision.

avaneev · 2022-12-08T02:38:06Z

It should be more like
komihash << "" << "ab" << "cd" << "" << "ef" << "gh"

But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given.

avaneev · 2022-12-08T02:42:12Z

So, a sub-hasher is probably needed that will look like:
komihash << (komihash << "ab" << "cd") << (komihash << "ef" << "gh")
From performance point, it's as fast as adding zero-lengthes. But anyway, a counter-example can still be given, except this will involve random numbers, so it won't look obvious.

avaneev · 2022-12-08T02:48:08Z

In my proposal I've assumed zero-terminated strings, so their lengths are non-zero even when they are empty. But that's also not perfect, because individual zero binary bytes can also happen in a hash stream.

oertl · 2022-12-08T07:57:48Z

But you are right, this is not perfect. Unfortunately, there seems to be no good solution, since if you will be putting array lengthes instead of empty strings, the same counter-example can be given.

When lengths (also those of the strings) are always appended, it should be fine:

{{"ab", "cd"}, {"ef", "gh"}}

would lead to a binary representation corresponding to

['a', 'b', 2, 'c', 'd', 2, 2, 'e', 'f', 2, 'g', 'h', 2, 2, 2]

wheras

{{"ab", "cd", "", "ef", "gh"}}

would give

['a', 'b', 2, 'c', 'd', 2, 0, 'e', 'f', 2, 'g', 'h', 2, 5, 1].

Of course, for tiny strings there is a lot of overhead, but in general it is a safe strategy which works with just a single hasher instance.

So, a sub-hasher is probably needed that will look like:
komihash << (komihash << "ab" << "cd") << (komihash << "ef" << "gh")

This is what I meant with "another hasher instance". This strategy requires creating further instances which also comes with some overhead (especially in Java). Hashing a nested data structure requires as many hasher instances as nesting levels.

avaneev · 2022-12-08T08:27:17Z

I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only.

Your example above collides with
{"ab", "cd", 2, "ef", "gh", 2, 2}, so comparably it's also problematic, even if just theoretical.

oertl · 2022-12-08T09:17:06Z

I do not think you need "another instance". A simple hash value stack will suffice, because the hashing function uses local variables only

Agree, but this could be difficult to realize in Java without additional overhead or object allocations.

Your example above collides with
{"ab", "cd", 2, "ef", "gh", 2, 2}, so comparably it's also problematic, even if just theoretical.

This is different, as this is a list of some polymorphic type (able to represent integers and strings). If you hash a polymorphic object you always have to include the type information. Then there will be no collision.

avaneev · 2022-12-10T12:56:54Z

I think that additional overhead for hash value stack would be much lower than the overhead associated with storing string lengths and types (for polymorphic objects), and buffering of course. Discrete-incremental capability may be a new word in high-performance database hashing since it is fast, universal, and type-agnostic.

oertl · 2022-12-10T14:14:46Z

I think type information is also necessary for the incremental approach in case of polymorphic types. Since, as you mentioned, Komihash might be the only algorithm that has the properties required for this incremental approach, I still think it is better to use the current approach by default. However, I see the potential and will consider adding this incremental approach as an alternative in the future. I will keep this issue open in the meanwhile.

You mentioned that there are statistical tests that prove that Komihash produces equally good hashes when it is run incrementally (e.g. when only one byte is added in each hashing round). Are such tests already part of smhasher?

avaneev · 2022-12-10T16:45:23Z

No, it is not in SMHasher. Here's the code you can try if you have SMHasher, simply change komihash() to komihash_di() call in SMHasher code. (its hash identifier differs, though)

static inline uint64_t komihash_di( const void* const Msg, const size_t MsgLen,
	const uint64_t UseSeed )
{
	if( MsgLen == 0 )
	{
		char tmp;
		return( komihash( &tmp, 0, UseSeed ));
	}
	uint64_t Hash = UseSeed;
	for( size_t i = 0; i < MsgLen; i++ )
	{
		Hash = komihash( (const uint8_t*) Msg + i, 1, Hash );
	}
	return( Hash );
}

avaneev added the enhancement New feature or request label Nov 11, 2022

oertl linked a pull request Nov 14, 2022 that will close this issue

suggested performance improvement by avaneev #50

Merged

oertl closed this as completed in #50 Nov 14, 2022

oertl reopened this Dec 7, 2022

avaneev changed the title ~~Incremental hashing~~ Discrete-Incremental Hashing Dec 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrete-Incremental Hashing #49

Discrete-Incremental Hashing #49

avaneev commented Nov 11, 2022

oertl commented Nov 11, 2022

oertl commented Nov 11, 2022

avaneev commented Nov 11, 2022

oertl commented Nov 11, 2022

avaneev commented Nov 11, 2022

oertl commented Nov 11, 2022

avaneev commented Nov 11, 2022

oertl commented Nov 14, 2022

avaneev commented Nov 14, 2022

oertl commented Dec 7, 2022

oertl commented Dec 7, 2022

avaneev commented Dec 7, 2022

oertl commented Dec 7, 2022

avaneev commented Dec 7, 2022

oertl commented Dec 7, 2022

avaneev commented Dec 7, 2022

oertl commented Dec 7, 2022

avaneev commented Dec 7, 2022 •

edited

Loading

oertl commented Dec 7, 2022

avaneev commented Dec 8, 2022

avaneev commented Dec 8, 2022

avaneev commented Dec 8, 2022 •

edited

Loading

oertl commented Dec 8, 2022

avaneev commented Dec 8, 2022

oertl commented Dec 8, 2022

avaneev commented Dec 10, 2022

oertl commented Dec 10, 2022

avaneev commented Dec 10, 2022

Discrete-Incremental Hashing #49

Discrete-Incremental Hashing #49

Comments

avaneev commented Nov 11, 2022

oertl commented Nov 11, 2022

oertl commented Nov 11, 2022

avaneev commented Nov 11, 2022

oertl commented Nov 11, 2022

avaneev commented Nov 11, 2022

oertl commented Nov 11, 2022

avaneev commented Nov 11, 2022

oertl commented Nov 14, 2022

avaneev commented Nov 14, 2022

oertl commented Dec 7, 2022

oertl commented Dec 7, 2022

avaneev commented Dec 7, 2022

oertl commented Dec 7, 2022

avaneev commented Dec 7, 2022

oertl commented Dec 7, 2022

avaneev commented Dec 7, 2022

oertl commented Dec 7, 2022

avaneev commented Dec 7, 2022 • edited Loading

oertl commented Dec 7, 2022

avaneev commented Dec 8, 2022

avaneev commented Dec 8, 2022

avaneev commented Dec 8, 2022 • edited Loading

oertl commented Dec 8, 2022

avaneev commented Dec 8, 2022

oertl commented Dec 8, 2022

avaneev commented Dec 10, 2022

oertl commented Dec 10, 2022

avaneev commented Dec 10, 2022

avaneev commented Dec 7, 2022 •

edited

Loading

avaneev commented Dec 8, 2022 •

edited

Loading