Invalid UTF-8 can be generated by span attribute limits #4740

jmacd · 2022-09-09T17:38:02Z

Describe the bug

The following truncation logic can produce invalid UTF-8. OTLP requires valid UTF-8. We have seen this issue in otel-go as well: open-telemetry/opentelemetry-go#3021

  /**
   * Apply the {@code lengthLimit} to the attribute {@code value}. Strings and strings in lists
   * which exceed the length limit are truncated.
   */
  public static Object applyAttributeLengthLimit(Object value, int lengthLimit) {
    if (lengthLimit == Integer.MAX_VALUE) {
      return value;
    }
    if (value instanceof List) {
      List<?> values = (List<?>) value;
      List<Object> response = new ArrayList<>(values.size());
      for (Object entry : values) {
        response.add(applyAttributeLengthLimit(entry, lengthLimit));
      }
      return response;
    }
    if (value instanceof String) {
      String str = (String) value;
      return str.length() < lengthLimit ? value : str.substring(0, lengthLimit);
    }
    return value;
  }

Steps to reproduce
Use an attribute with multi-byte characters, apply an attribute size limit so that truncation occurs. Now export using a protobuf library that validates UTF-8 for its string fields (which not all libraries do). The Golang protobuf library does UTF-8 validation, which makes it impossible for the OTel collector to receive OTLP data with invalid UTF-8. Where the SDK has control over this matter, the SDK should avoid creating invalid UTF-8.

What did you expect to see?
UTF-8-aware truncation logic.

What did you see instead?
The code snippet above.

What version and what artifacts are you using?
This is a hypothetical bug report based on reviewing source.

Additional context
See the bigger question: open-telemetry/opentelemetry-specification#3421, open-telemetry/opentelemetry-specification#504

The text was updated successfully, but these errors were encountered:

jkwatson · 2022-09-12T19:34:02Z

Java strings and characters are inherently multibyte, and substring is aware of that (it works on characters, not bytes). I don't believe this is an actual bug.

jmacd added the Bug Something isn't working label Sep 9, 2022

jmacd closed this as completed Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid UTF-8 can be generated by span attribute limits #4740

Invalid UTF-8 can be generated by span attribute limits #4740

jmacd commented Sep 9, 2022

jkwatson commented Sep 12, 2022

Invalid UTF-8 can be generated by span attribute limits #4740

Invalid UTF-8 can be generated by span attribute limits #4740

Comments

jmacd commented Sep 9, 2022

jkwatson commented Sep 12, 2022