Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics SDK] Performance improvement in measurement processing #1993

Merged
merged 33 commits into from
Mar 4, 2023

Conversation

lalitb
Copy link
Member

@lalitb lalitb commented Feb 21, 2023

This PR attempts to improve the measurement processing throughput - with benchmark as the time taken by SDK to process 1 million measurements with multiple dimensions and high cardinality.

The change is to remove the unnecessary memory copy operations done for each input measurement. The measurement is now copied to OrderedAttributeMap only if it is not already existing in the AttributesHashMap.

Existing flow:
instrument::Add(value, attributes) -> Copy attributes to OrderedAttributeMap -> Calculate hash for attributes in OrderedAttributeMap -> Add attributes to AttributesHashMap if not existing (by comparing hash calculated in step 3) -> Aggregate(value)

New flow:
instrument::Add(value, attributes) -> Calculate hash for attributes in KeyValueIterable -> Copy attributes to OrderedAttributeMap, and then add to AttributesHashMap if not existing (by comparing hash calculated in step 2) -> Aggregate(value)

The processing time for 1 million measurements is reduced from 3.7 secs to 1.9 secs ( on CPU: Intel(R) Core(TM) i9-10900 CPU @ 2.80GHz, 20 processors, Memory: 32GB)

The test results are for below scenario:

Number of dimensions/attributes : 3
Cardinality of each dimension : 10
Total possible dimensions: 10 * 10 * 10 = 1000
Total number of measurements : 1 million

(The benchmark code is added as part of this PR - measurements_benchmark.cc:

Benchmark with this PR changes:

Running ./measurements_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 8.23, 3.09, 1.54
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_MeasurementsTest 1988578400 ns   1988543700 ns            1

Benchmark with existing code:

2023-02-20 22:23:53
Running ./measurements_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 2.84, 2.51, 1.45
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_MeasurementsTest 3794032100 ns   3794023000 ns            1

Changes

Please provide a brief description of the changes here.

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

@lalitb lalitb requested a review from a team February 21, 2023 08:13
@codecov
Copy link

codecov bot commented Feb 21, 2023

Codecov Report

Merging #1993 (7d7755b) into main (649829f) will increase coverage by 0.01%.
The diff coverage is 89.42%.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1993      +/-   ##
==========================================
+ Coverage   87.32%   87.32%   +0.01%     
==========================================
  Files         166      166              
  Lines        4673     4723      +50     
==========================================
+ Hits         4080     4124      +44     
- Misses        593      599       +6     
Impacted Files Coverage Δ
...ntelemetry/sdk/metrics/view/attributes_processor.h 90.00% <50.00%> (-10.00%) ⬇️
...clude/opentelemetry/sdk/common/attributemap_hash.h 83.88% <84.22%> (+7.41%) ⬆️
sdk/src/metrics/state/temporal_metric_storage.cc 98.31% <87.50%> (+0.06%) ⬆️
...ntelemetry/sdk/metrics/state/sync_metric_storage.h 86.37% <88.89%> (-1.13%) ⬇️
...entelemetry/sdk/metrics/state/attributes_hashmap.h 95.24% <96.67%> (-0.59%) ⬇️
...telemetry/sdk/metrics/state/async_metric_storage.h 86.49% <100.00%> (+0.38%) ⬆️

@lalitb
Copy link
Member Author

lalitb commented Feb 22, 2023

This is ready for review now. @ThomsonTan , @esigo please help in review this PR. We can enable exemplar once this is merged, as exemplar has it's own performance cost.

@lalitb
Copy link
Member Author

lalitb commented Feb 25, 2023

Added few more changes to reduce the lock contention (by removing AttributeHashMap hash calculation outside of mutex lock). Gained few more seconds, and better scaling with threads:

Number of threads : 1

$ ./measurements_benchmark
2023-02-24 23:09:48
Running ./measurements_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 0.33, 1.90, 1.20
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_MeasurementsTest 1582235960 ns       104430 ns           10

Number of threads: 5

$ ./measurements_benchmark
2023-02-24 23:15:19
Running ./measurements_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 0.20, 0.70, 0.87
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_MeasurementsTest  991755670 ns       306610 ns           10

Number of threads: 19

$ ./measurements_benchmark
2023-02-24 23:18:25
Running ./measurements_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 0.65, 0.67, 0.81
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_MeasurementsTest  926599920 ns       878030 ns           10

So, 1.5 secs to process 1 million measurements from single thread, .98 secs to process 1 million measurements from 5 threads, .92 secs to process 1 million measurements from 19 threads. This is on machine with 20 processors. So, while the library scales well from 1 to 5 threads, it reaches its performance limit beyond that. As valgrind/callgrind cpu profiling result shows, the limit is reached for the synchronized access to std::unordered_map used inside AttributesHashMap across all these threads. I think we can live with current performance for now unless there are any cross-platform lock-free hashmap data structure we can easily used in our code.

@esigo esigo added the size/L Denotes a PR that changes 100-499 lines. label Feb 26, 2023
@lalitb
Copy link
Member Author

lalitb commented Mar 1, 2023

The results are more promising while building the application in Release mode (similar results on Windows and Linux).
So to process 1 million measurements:
Time taken (1 thread) : ~130 ms
Time taken (5 threads): ~147 ms
Time taken (19 threads): ~165 ms

Number of threads : 1

$ ./measurements_benchmark
2023-02-28 23:32:31
Running ./measurements_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 0.16, 0.98, 0.88
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_MeasurementsTest  133399288 ns       132125 ns          100

Number of threads: 5

$ ./measurements_benchmark
2023-02-28 23:36:06
Running ./measurements_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 0.62, 0.71, 0.78
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_MeasurementsTest  147010309 ns       359316 ns          100

Number of threads: 19

$ ./measurements_benchmark
2023-02-28 23:37:36
Running ./measurements_benchmark
Run on (20 X 2808.01 MHz CPU s)
CPU Caches:
  L1 Data 32K (x10)
  L1 Instruction 32K (x10)
  L2 Unified 256K (x10)
  L3 Unified 20480K (x1)
Load Average: 0.33, 0.60, 0.73
--------------------------------------------------------------
Benchmark                    Time             CPU   Iterations
--------------------------------------------------------------
BM_MeasurementsTest  165881775 ns      1205977 ns          100

Copy link
Member

@esigo esigo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Thanks for the PR :)
sorry for the delay

@esigo esigo added the ok-to-merge The PR is ok to merge (has two approves or raised by a maintainer/approver and has one approve) label Mar 3, 2023
@lalitb lalitb enabled auto-merge (squash) March 4, 2023 04:11
@lalitb lalitb merged commit da333f8 into open-telemetry:main Mar 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-merge The PR is ok to merge (has two approves or raised by a maintainer/approver and has one approve) size/L Denotes a PR that changes 100-499 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants