Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ISSUE #10377] Nacos client observability enhancement - trace part #11138

Merged
merged 54 commits into from
Oct 30, 2023

Conversation

FAWC438
Copy link

@FAWC438 FAWC438 commented Sep 15, 2023

Please do not create a Pull Request without creating an issue first.

What is the purpose of the change

The purpose of this PR is to enhance the observability of Nacos client. This pr ensures that users can obtain metric through OpenTelemetry and Prometheus; obtain trace through OpenTelemetry.

This PR is only related to the trace part.

Related issue: #10377

Brief changelog

Trace

  • Using OpenTelemetry API to create traces that fully complies with the OpenTelemetry standard
  • All Nacos client spans are divided into 3 levels
    1. Service: The span buried points at this level are mainly concentrated in functions/classes that Nacos users can directly call.
    2. Work: The span buried points at this level are mainly concentrated in the core code logic of the Nacos client, including but not limited to listener processing, server request processing, etc.
    3. HTTP/RPC: The span buried points at this level are mainly focused on the network communication level of the Nacos client, that is, the observation of HTTP/RPC entrances and exits.
  • Users can obtain trace indicators using any method recommended by OpenTelemetry, including automatic Instrumentation and manual Instrumentation
  • Related unit tests. It also includes test cases that can export traces directly to jaeger (jaeger requires additional deployment)

Verifying this change

Follow this checklist to help us incorporate your contribution quickly and easily:

  • Make sure there is a Github issue filed for the change (usually before you start working on it). Trivial changes like typos do not require a Github issue. Your pull request should address just this issue, without pulling in other changes - one PR resolves one issue.
  • Format the pull request title like [ISSUE #123] Fix UnknownException when host config not exist. Each commit in the pull request should have a meaningful subject line and body.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Write necessary unit-test to verify your logic correction, more mock a little better when cross module dependency exist. If the new feature or significant change is committed, please remember to add integration-test in test module.

Copy link
Contributor

@pixystone pixystone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to discuss.

client/pom.xml Outdated
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-semconv</artifactId>
<version>1.29.0-alpha</version>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any stable version?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The opentelemetry-semconv library is actually designed to make the span field naming comply with otlp regulations. It itself can be regarded as a collection of global variables. However, all versions of this library have the alpha suffix, which is the case in the official examples and also in the maven repository https://mvnrepository.com/artifact/io.opentelemetry/opentelemetry-semconv

Span addListenerSpan = ConfigTrace.getClientConfigServiceSpan("addTenantListeners");
try (Scope ignored = addListenerSpan.makeCurrent()) {

worker.addTenantListenersWithContent(dataId, group, content, encryptedDataKey, Arrays.asList(listener));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing process may disturb adding listeners. It is expected that nothing can interrupt the main behaviors.
Make sure that span.makeCurrent() and span.isRecording() may not throw any exceptions in any situations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All my trace-related operations are implemented through the OpenTelemetry API, including the span life cycle. OpenTelemetry clearly states it in the documentation: Unless your application imports the OpenTelemetry SDK, your instrumentation does nothing and does not impact application performance. https://opentelemetry.io/docs/concepts/instrumentation/libraries/#opentelemetry-api

.queryConfig(dataId, group, worker.getAgent().getTenant(), timeoutMs, false);

if (queryConfigSpan.isRecording()) {
queryConfigSpan.setAttribute("function.current.name",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use constants instead of magic value here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix it soon

@@ -101,11 +105,60 @@ public String getConfig(String dataId, String group, long timeoutMs) throws Naco
public String getConfigAndSignListener(String dataId, String group, long timeoutMs, Listener listener)
throws NacosException {
group = StringUtils.isBlank(group) ? Constants.DEFAULT_GROUP : group.trim();
ConfigResponse configResponse = worker.getAgent()
.queryConfig(dataId, group, worker.getAgent().getTenant(), timeoutMs, false);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an enhanced subclass of ClientWorker or AOP processing is a better way to wrap or decorate these methods.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I will think about it carefully and discuss it with you

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An enhanced subclass may be better, since a span could include more than one method

if (content != null) {
LOGGER.warn("[{}] [get-config] get failover ok, dataId={}, group={}, tenant={}, config={}",
worker.getAgentName(), dataId, group, tenant, ContentUtils.truncateContent(content));
cr.setContent(content);
String encryptedDataKey = LocalEncryptedDataKeyProcessor
.getEncryptDataKeyFailover(agent.getName(), dataId, group, tenant);
String encryptedDataKey = LocalEncryptedDataKeyProcessor.getEncryptDataKeyFailover(agent.getName(), dataId,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracing the update of local failover cache is also important.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem, I will do it here

requestToServer(request, BatchInstanceResponse.class);
redoService.instanceRegistered(serviceName, groupName);

if (span.isRecording()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some attributes can be set before doing actions in order to record more information for exceptions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some attributes, especially those that are expensive to calculate ( eg. some instance states), should be added if span is recording. https://opentelemetry.io/docs/concepts/instrumentation/libraries/#performance However, some attributes in the Nacos client may not need to be operated in this way (in fact, I have made the version number of the nacos client a must-have attribute for each span). I will discuss it with you in detail.

@FAWC438 FAWC438 changed the title [OSPP #10377] Nacos client observability enhancement [ISSUE #10377] Nacos client observability enhancement Sep 21, 2023
@FAWC438 FAWC438 changed the title [ISSUE #10377] Nacos client observability enhancement [ISSUE #10377] Nacos client observability enhancement - trace part Sep 21, 2023
Copy link
Contributor

@pixystone pixystone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Request some further changes in another PR.

@@ -1098,10 +1158,19 @@ public ConfigResponse queryConfig(String dataId, String group, String tenant, lo
}

private Response requestProxy(RpcClient rpcClientInner, Request request) throws NacosException {
return requestProxy(rpcClientInner, request, 3000L);
return agent.requestProxy(rpcClientInner, request, 3000L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a method with some default values. Please use this.requestProxy to keep modification of the origin method like metrics or tracing enhancement.

@@ -477,8 +501,11 @@ private void refreshContentAndCheck(String groupKey, boolean notify) {

private void refreshContentAndCheck(CacheData cacheData, boolean notify) {
try {
ConfigResponse response = getServerConfig(cacheData.dataId, cacheData.group, cacheData.tenant, 3000L,
notify);
String group = cacheData.group;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose getServerConfig is same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no reason to use dynamic proxy. It would be enough if just delegating all ClientWorker methods.

@KomachiSion KomachiSion merged commit e955313 into alibaba:summer-ospp#10377 Oct 30, 2023
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants