Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid UTF-8 can be generated by span attribute limits #4740

Closed
jmacd opened this issue Sep 9, 2022 · 1 comment
Closed

Invalid UTF-8 can be generated by span attribute limits #4740

jmacd opened this issue Sep 9, 2022 · 1 comment
Labels
Bug Something isn't working

Comments

@jmacd
Copy link

jmacd commented Sep 9, 2022

Describe the bug

The following truncation logic can produce invalid UTF-8. OTLP requires valid UTF-8. We have seen this issue in otel-go as well: open-telemetry/opentelemetry-go#3021

  /**
   * Apply the {@code lengthLimit} to the attribute {@code value}. Strings and strings in lists
   * which exceed the length limit are truncated.
   */
  public static Object applyAttributeLengthLimit(Object value, int lengthLimit) {
    if (lengthLimit == Integer.MAX_VALUE) {
      return value;
    }
    if (value instanceof List) {
      List<?> values = (List<?>) value;
      List<Object> response = new ArrayList<>(values.size());
      for (Object entry : values) {
        response.add(applyAttributeLengthLimit(entry, lengthLimit));
      }
      return response;
    }
    if (value instanceof String) {
      String str = (String) value;
      return str.length() < lengthLimit ? value : str.substring(0, lengthLimit);
    }
    return value;
  }

Steps to reproduce
Use an attribute with multi-byte characters, apply an attribute size limit so that truncation occurs. Now export using a protobuf library that validates UTF-8 for its string fields (which not all libraries do). The Golang protobuf library does UTF-8 validation, which makes it impossible for the OTel collector to receive OTLP data with invalid UTF-8. Where the SDK has control over this matter, the SDK should avoid creating invalid UTF-8.

What did you expect to see?
UTF-8-aware truncation logic.

What did you see instead?
The code snippet above.

What version and what artifacts are you using?
This is a hypothetical bug report based on reviewing source.

Additional context
See the bigger question: open-telemetry/opentelemetry-specification#3421, open-telemetry/opentelemetry-specification#504

@jmacd jmacd added the Bug Something isn't working label Sep 9, 2022
@jkwatson
Copy link
Contributor

Java strings and characters are inherently multibyte, and substring is aware of that (it works on characters, not bytes). I don't believe this is an actual bug.

@jmacd jmacd closed this as completed Sep 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants