Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrency HTTP sessions for OtlpHttpClient #1209

Closed
wants to merge 9 commits into from

Conversation

owent
Copy link
Member

@owent owent commented Feb 13, 2022

Fixes #1176

Changes

  • Allow multiple running http sessions in OtlpHttpClient
  • Add concurrency_sessions to control the max concurrency http sessions.

For significant contributions please make sure you have completed the following items:

  • CHANGELOG.md updated for non-trivial changes
  • Unit tests have been added
  • Changes in public API reviewed

@owent owent requested a review from a team February 13, 2022 11:05
@owent owent changed the title \[WIP\] Cocurrency http sessions for OtlpHttpClient [WIP] Cocurrency http sessions for OtlpHttpClient Feb 13, 2022
@codecov
Copy link

codecov bot commented Feb 13, 2022

Codecov Report

Merging #1209 (075bc27) into main (31d888a) will decrease coverage by 0.30%.
The diff coverage is 80.72%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1209      +/-   ##
==========================================
- Coverage   91.99%   91.69%   -0.29%     
==========================================
  Files         205      205              
  Lines        7395     7528     +133     
==========================================
+ Hits         6802     6902     +100     
- Misses        593      626      +33     
Impacted Files Coverage Δ
...lemetry/exporters/memory/in_memory_span_exporter.h 75.00% <0.00%> (-16.30%) ⬇️
...de/opentelemetry/exporters/ostream/span_exporter.h 100.00% <ø> (ø)
exporters/ostream/src/span_exporter.cc 87.26% <0.00%> (-3.56%) ⬇️
...nclude/opentelemetry/ext/http/client/http_client.h 93.34% <ø> (ø)
sdk/include/opentelemetry/sdk/trace/exporter.h 100.00% <ø> (ø)
sdk/test/trace/simple_processor_test.cc 77.42% <0.00%> (-8.29%) ⬇️
api/include/opentelemetry/common/timestamp.h 81.25% <25.00%> (-18.75%) ⬇️
...ntelemetry/ext/http/client/curl/http_client_curl.h 87.81% <35.72%> (-4.50%) ⬇️
...include/opentelemetry/sdk/trace/simple_processor.h 86.96% <57.15%> (-13.04%) ⬇️
sdk/src/trace/batch_span_processor.cc 93.44% <94.74%> (+0.41%) ⬆️
... and 3 more

@owent owent changed the title [WIP] Cocurrency http sessions for OtlpHttpClient Cocurrency http sessions for OtlpHttpClient Feb 14, 2022
@owent owent force-pushed the cocurrency_otlp_http_session branch from 1cafb98 to 073c598 Compare February 15, 2022 03:06
@lalitb
Copy link
Member

lalitb commented Feb 17, 2022

Thanks for the PR. Would be reviewing this early next week.

@owent
Copy link
Member Author

owent commented Feb 19, 2022

Thanks for the PR. Would be reviewing this early next week.

After #1185 is merged, there are no unit test for concurrency sessions now. I wonder whether should I add unit test for this, if so, the connecting problem (in CI / Bazel valgrind) may happen again.

session.second->CancelSession();
std::map<uint64_t, std::shared_ptr<Session>> sessions;
sessions.swap(sessions_);
for (auto &session : sessions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why iterating over sessions_ is not ok but iterating over sessions is ok here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The call stack HttpClient::CancelAllSessions -> Session::CancelSession -> HttpClient::CleanupSession will crash here.
CleanupSession will change sessions_ when iterating it.

@@ -256,6 +259,8 @@ TEST_F(OtlpHttpLogExporterTestPeer, ExportBinaryIntegrationTest)
http_client::nosend::Response response;
callback.OnResponse(response);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't we need response.Finish(callback); here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to reuse it to trigger OnEvent and call OnResponse. The codes do not trigger OnEvent before.

request->SetMethod(http_client::Method::Post);
request->SetBody(body_vec);
request->ReplaceHeader("Content-Type", content_type);
std::lock_guard<std::recursive_mutex> guard{session_manager_lock_};
Copy link
Member

@esigo esigo Feb 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to do all the processing before checking for Shutdown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's used to protect isShutdown(), is_shutdown_, running_sessions_ and http_client_. They are all not thread-safe.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make isShutdown thread safe, and short circuit here (e.g. return if isShutdown() is true).

bool OtlpHttpClient::isShutdown() const noexcept
{
const std::lock_guard<opentelemetry::common::SpinLockMutex> locked(lock_);
return is_shutdown_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without lock this read would have race condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_shutdown_ and isShutdown() are protected by session_manager_lock_ now.

@owent
Copy link
Member Author

owent commented Feb 20, 2022

The timeout test seems not a problem of this PR

(//sdk/test/common:circular_buffer_benchmark_smoketest under valgrind)

@owent
Copy link
Member Author

owent commented Feb 23, 2022

We have a internal benmark report which shows this PR increase QPS of OtlpHttpLogExporter from 4k/second to 13k/second.

std::unique_lock<std::mutex> lock(session_waker_lock_);
bool wait_successful = session_waker_.wait_for(lock, options_.timeout, [this] {
std::lock_guard<std::recursive_mutex> guard{session_manager_lock_};
return running_sessions_.size() <= options_.concurrent_sessions;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if number of current running sessions are less than max-configured? Export will return success even though there is no real successful export ?

Copy link
Member Author

@owent owent Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.
We have a service need to export a lot logs.When the QPS is greater than 4k, the thread of otel-cpp cost only 7% of CPU time but it start to drop datas. Because when the exporter is waiting for http response, the batch log processor is still receiving more logs and the queue in it grow quickly and will be full soon.
In our test,after we set concurrent session to 8, it can send more than 13k request/second and drop nothing.

Copy link
Member

@lalitb lalitb Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can see the improvement we can achieve with these changes. The concern I have is that this deviates from the specs (https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#interface-definition-1), which states that the Export method should return the result of transfer of the data over the wire. This can allow the Span Processor to take decision accordingly say retrying the failed transactions.
We may want to change the Export method to return the result callback to make these changes compliant to specs ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, had a quick look into how other SIGs are handling this. Java returns export status to processor asynchronously using CompletableFuture, while JS returns status using callback.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@owent owent Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, had a quick look into how other SIGs are handling this. Java returns export status to processor asynchronously using CompletableFuture, while JS returns status using callback.

Good idea. Maybe we can use nostd::variant<std::future<sdk::common::ExportResult>> here. But it's a break change for exporters's APIs. I think it need more discussion about it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/protocol/otlp.md#otlpgrpc-concurrent-requests

The OtlpGrpcExporter and OtlpGrpcLogExporter have the same problem, but I didn't try to use asynchronize API or test the performance yet.

Copy link
Member

@lalitb lalitb Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it's a break change for exporters's APIs. I think it need more discussion about it.

Yes, I was thinking about that too. We have to keep supporting the existing Exporter::Export() interface and add new using callback/future. And a new config option for SpanProcessor to select which interface to use. But it needs some thought and discussion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OtlpGrpcExporter and OtlpGrpcLogExporter have the same problem, but I didn't try to use asynchronize API or test the performance yet.

Agree, they too suffer from the same issue.

@lalitb
Copy link
Member

lalitb commented Feb 23, 2022

We have a internal benmark report which shows this PR increase QPS of OtlpHttpLogExporter from 4k/second to 13k/second.

Good to know that. I had quick glance at the Export() function, will review it more thoroughly tomorrow. Sorry for the delay on this.

@reyang reyang changed the title Cocurrency http sessions for OtlpHttpClient Concurrency HTTP sessions for OtlpHttpClient Feb 23, 2022
*
* @return return true if there are more sessions to delete
*/
bool cleanupGCSessions() noexcept;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is "GC"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GC here mean garbage collection of http sessions.When a http session is finished, we can not destroy it immediately right now, because it may still be visited in some codes. (e.g. When we remove a session in OnResponse, it will be visited in OnEvent later).
So I move the finished sessions into gc_sessions_ and real destroy it soon later.

@@ -99,6 +109,8 @@ class OtlpHttpClient
*/
explicit OtlpHttpClient(OtlpHttpClientOptions &&options);

~OtlpHttpClient();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed to be explicit?

Copy link
Member Author

@owent owent Feb 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we need to hold the lock and then clean up all sessions. When removing sessions, we must make sure not to change gc_sessions_ when we iterating it to call FinishSession of all running sessions.

@lalitb
Copy link
Member

lalitb commented Feb 23, 2022

@owent - This is a great initiative, thanks once again for the PR. Do you think it would make sense to have exporter return export status as callback and let the processor manage the concurrent export magic? This will ensure we are specs compliant, and also not much changes would be required to support Zipkin exporter. And later, we can make it truly asynchronous by adding support of multi_handle for curl client.

@owent owent force-pushed the cocurrency_otlp_http_session branch from 6c8838e to 6ae64e5 Compare February 24, 2022 12:44
@owent owent force-pushed the cocurrency_otlp_http_session branch from 6ae64e5 to fca159b Compare March 4, 2022 03:13
owent added a commit to owent/opentelemetry-cpp that referenced this pull request Mar 5, 2022
+ Temporary fix the `Export()` may be called too many times when we shutdown or just wakeup worker thread once.This problem is completely fixed in open-telemetry#1209 but not merged yet.
+ Fix http client may has a different error code when our network is under a proxy.
owent added a commit to owent/opentelemetry-cpp that referenced this pull request Mar 7, 2022
+ Temporary fix the `Export()` may be called too many times when we shutdown or just wakeup worker thread once.This problem is completely fixed in open-telemetry#1209 but not merged yet.
+ Fix http client may has a different error code when our network is under a proxy.
owent added a commit to owent/opentelemetry-cpp that referenced this pull request Mar 8, 2022
+ Temporary fix the `Export()` may be called too many times when we shutdown or just wakeup worker thread once.This problem is completely fixed in open-telemetry#1209 but not merged yet.
+ Fix http client may has a different error code when our network is under a proxy.
@owent owent force-pushed the cocurrency_otlp_http_session branch 2 times, most recently from 659e11f to e47da76 Compare March 13, 2022 05:31
@owent owent force-pushed the cocurrency_otlp_http_session branch from e47da76 to 90942be Compare March 21, 2022 05:40
@owent owent mentioned this pull request Mar 21, 2022
3 tasks
@owent owent force-pushed the cocurrency_otlp_http_session branch from 90942be to 521831f Compare March 22, 2022 08:12
@owent owent force-pushed the cocurrency_otlp_http_session branch from 521831f to 4174407 Compare March 23, 2022 04:32
…pHttpLogExporter`.

Add tests for both sync and async exporting.

Signed-off-by: owent <admin@owent.net>
@owent owent force-pushed the cocurrency_otlp_http_session branch from 4174407 to 8ed3424 Compare March 23, 2022 04:35
Signed-off-by: owent <admin@owent.net>
…alled without any span and log in async mode

Signed-off-by: owent <admin@owent.net>
…ecords when we shudown batch log/span processor.

Signed-off-by: owent <admin@owent.net>
@lalitb
Copy link
Member

lalitb commented Mar 23, 2022

@owent - can we close this PR if the changes are already in feature branch?

1 similar comment
@lalitb
Copy link
Member

lalitb commented Mar 23, 2022

@owent - can we close this PR if the changes are already in feature branch?

@owent
Copy link
Member Author

owent commented Mar 24, 2022

@owent - can we close this PR if the changes are already in feature branch?

Yes, this is already moved to async-changes

@lalitb
Copy link
Member

lalitb commented May 3, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not waitForResponse in OtlpHttpClient::Export ?
6 participants