Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use better hashing in cache key calcuation #2162

Open
t-b opened this issue Jul 5, 2024 · 0 comments
Open

Use better hashing in cache key calcuation #2162

t-b opened this issue Jul 5, 2024 · 0 comments
Labels
bug Something isn't working CodeQuality

Comments

@t-b
Copy link
Collaborator

t-b commented Jul 5, 2024

Using StringCRC is actually quite bad for calculating a hash of the data, as 1 states

CRC-32 poses the highest risk for hash collisions. This hash function is generally not recommended for use. If a hub were to contain 77,163 hash values, the chance of a hash collision occurring is 50%, which is extremely high compared to other methods.

AllenInstitute/mies-utils-xop#1 has a WIP branch with XXH32 from 2 which has the following performance characteristics.

Function dostuff()

	variable ref, i, numLoops = 1e5
	string str = "abcdefghijklmnopqrstuvwxyz"
	str = ReplicateString(str, 10)
	variable result
	string resultStr
	
	ref = stopmstimer(-2)
	for(i = 0; i < numLoops; i += 1)
		result = StringCrC(0, str)
	endfor

	printf "CRC32:    %.15d, time: %f micro seconds\r", result, (stopmstimer(-2) - ref) / numLoops

	ref = stopmstimer(-2)
	for(i = 0; i < numLoops; i += 1)
		result = MU_CalcHash(str, 0)
	endfor

	printf "XXHash32: %.15d, time: %f micro seconds\r", result, (stopmstimer(-2) - ref) / numLoops

	ref = stopmstimer(-2)
	for(i = 0; i < numLoops; i += 1)
		resultStr = Hash(str, 3)
	endfor

	printf "MD5:      %.15s, time: %f micro seconds\r", resultStr, (stopmstimer(-2) - ref) / numLoops

	ref = stopmstimer(-2)
	for(i = 0; i < numLoops; i += 1)
		resultStr = Hash(str, 1)
	endfor

	printf "SHA256:   %.15s, time: %f micro seconds\r", resultStr, (stopmstimer(-2) - ref) / numLoops
End

gives

•dostuff()
  CRC32:    000001268438376, time: 0.198013 mikro seconds
  XXHash32: 000001044162042, time: 0.467863 mikro seconds
  MD5:      4e6405697346169, time: 1.934640 mikro seconds
  SHA256:   6f6a4460cbd9241, time: 5.233524 mikro seconds

so while a tiny bit slower than CRC32 this is much faster as the other hash algorithms.

Adding this new hash would also allow us to move to XXhash from SHA* where we don't need cryptographic hashes.

@t-b t-b added bug Something isn't working CodeQuality labels Jul 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CodeQuality
Projects
None yet
Development

No branches or pull requests

1 participant