-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Use Set hash in Distinct (2nd attempt) #7175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7175 +/- ##
=====================================
Coverage 85.4% 85.4%
=====================================
Files 269 271 +2
Lines 24297 24391 +94
=====================================
+ Hits 20762 20843 +81
- Misses 3158 3170 +12
- Partials 377 378 +1
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to consider what is an acceptable level of hash collision here. We may want to use a larger bit size for the hash or look into ways to avoid the collision if we determine the collision rate too high. We may also consider this approach acceptable.
For reference, Prometheus has used as similar size hash for their implementation. There haven't been any known issues related to collisions there and it has been out for ~ 10 years. |
Some input from the sidelines (and yay on the performance improvements! 🥳):
It looks like collision handling was added to Besides the client library, we have seen some real collisions over the years in the Prometheus server that have all been addressed:
We still use hashes without collision checking in some places, which are hopefully unlikely enough to ever cause a collision. See also prometheus/prometheus#12519 for more discussion on hash collision potential in Prometheus. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing, thanks! Looks good modulo previous comments around runes/bytes on hash string and unmarshalling questions 👍🏽
a8037e9
to
6bf5e93
Compare
@dashpole are we going to look into collision remediation techniques based on the Prometheus groups experience? |
Yes, I will look into what was done in prometheus/client_golang#221. |
The TL;DR is that handling collisions just means using a I think the biggest difference for the attributes package is that we would need to ensure that |
162f87d
to
6ffffec
Compare
Forked from #7175 ``` $ go test -timeout 60s -run=xxxxxMatchNothingxxxxx -test.benchtime=10ms -count 6 -cpu 1 -bench=BenchmarkEquals ./... goos: linux goarch: amd64 pkg: go.opentelemetry.io/otel/attribute cpu: Intel(R) Xeon(R) CPU @ 2.20GHz BenchmarkEquals/Empty 1314884 9.008 ns/op BenchmarkEquals/Empty 1875915 8.461 ns/op BenchmarkEquals/Empty 1461613 7.749 ns/op BenchmarkEquals/Empty 1636912 9.359 ns/op BenchmarkEquals/Empty 1863039 6.355 ns/op BenchmarkEquals/Empty 1789053 6.336 ns/op BenchmarkEquals/1_string_attribute 674168 16.92 ns/op BenchmarkEquals/1_string_attribute 701983 16.42 ns/op BenchmarkEquals/1_string_attribute 692001 16.52 ns/op BenchmarkEquals/1_string_attribute 687970 16.29 ns/op BenchmarkEquals/1_string_attribute 751766 16.58 ns/op BenchmarkEquals/1_string_attribute 703534 16.88 ns/op BenchmarkEquals/10_string_attributes 85400 137.1 ns/op BenchmarkEquals/10_string_attributes 91045 136.1 ns/op BenchmarkEquals/10_string_attributes 90973 150.7 ns/op BenchmarkEquals/10_string_attributes 62877 177.5 ns/op BenchmarkEquals/10_string_attributes 90780 194.2 ns/op BenchmarkEquals/10_string_attributes 91058 144.6 ns/op BenchmarkEquals/1_int_attribute 624625 18.72 ns/op BenchmarkEquals/1_int_attribute 689478 16.03 ns/op BenchmarkEquals/1_int_attribute 719173 15.68 ns/op BenchmarkEquals/1_int_attribute 707005 16.18 ns/op BenchmarkEquals/1_int_attribute 752048 15.94 ns/op BenchmarkEquals/1_int_attribute 752034 16.23 ns/op BenchmarkEquals/10_int_attributes 90302 132.5 ns/op BenchmarkEquals/10_int_attributes 89929 131.9 ns/op BenchmarkEquals/10_int_attributes 86578 135.2 ns/op BenchmarkEquals/10_int_attributes 90482 133.1 ns/op BenchmarkEquals/10_int_attributes 90255 132.0 ns/op BenchmarkEquals/10_int_attributes 87615 134.6 ns/op PASS ok go.opentelemetry.io/otel/attribute 0.578s PASS ok go.opentelemetry.io/otel/attribute/internal 0.017s ``` --------- Co-authored-by: Flc゛ <four_leaf_clover@foxmail.com> Co-authored-by: Robert Pająk <pellared@hotmail.com>
a69ecb9
to
431c5ed
Compare
Forked from #7175 (comment) This adds a test for JSON marshaling of attribute sets. --------- Co-authored-by: Flc゛ <four_leaf_clover@foxmail.com> Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
a3fb9c7
to
ebbb889
Compare
There is a significant performance cost (the cost of set.Equals from #7262) to check for duplicates, but it is still a huge improvement over the current state. I'm trying to add a unit test with a collision, but it is taking a while for the fuzzer to generate a collision for me... Even with a cardinality of 100 million attributes, I didn't hit one :). I updated the benchmark results in the description. |
5804461
to
d60622b
Compare
Julius already cited this, but I too want to note there were issues: Prometheus has generally moved from
|
Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>
Co-authored-by: Robert Pająk <pellared@hotmail.com>
70744f0
to
3add646
Compare
c8d11bd
to
4ed4e0d
Compare
I plan to investigate xxHash in a separate issue. From my brief research, it is primarily designed for efficient hashing of large amounts of input data (e.g. checksums for files). It should still be faster than fnv64-1a for our inputs, but the difference may or may not be significant. I've seen conflicting statements for which algorithm has a better distribution of hash values, although most seem to favor xxHash as well. I don't think that should block this PR considering the substantial performance benefits it brings. |
Re-opening #5028 with new benchmarks. For cases with 10 attributes, this reduces the overhead of metric measurements by ~80-90% (depending on lock contention). It introduces a small probability of collision for attribute sets in the metrics SDK. For an instrument with 1 million different attribute sets, the probability of a collision is approximately 2 * 10^-8. For a more "normal" cardinality of 1000 on an instrument, it is approximately 2 * 10^-17.
Parallel benchmarks:
Single-threaded benchmarks: