Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify the valid content in primitive type string #3950

Open
XSAM opened this issue Mar 20, 2024 · 16 comments
Open

Clarify the valid content in primitive type string #3950

XSAM opened this issue Mar 20, 2024 · 16 comments
Labels
spec:miscellaneous For issues that don't match any other spec label triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:deciding:needs-info Not enough information. It is left open to provide the author with time to add more details

Comments

@XSAM
Copy link
Member

XSAM commented Mar 20, 2024

What are you trying to achieve?

A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not, for primitive type string of attribute on the common page.

A primitive type: string, boolean, double precision floating point (IEEE 754-1985) or signed 64 bit integer.

Additional context.

Baggage value only allows UTF-8 string.

Baggage **values** are any valid UTF-8 strings. Language API MUST accept

Attribute type mapping defines a way to handle non-UTF-8 in a string, but its doc status is Experimental.

String values which are not valid Unicode sequences SHOULD be converted to
AnyValue's
[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37)
with the bytes representing the string in the original order and format of the
source string.

Log data model defines the string represents as a Unicode string (UTF-8).

text_payload | string | The log entry payload, represented as a Unicode string (UTF-8). | Body

There is no clear guideline to determine whether a string type allows arbitrary bytes or not.


Related: open-telemetry/opentelemetry-go#5089 (review)

@XSAM XSAM added the spec:miscellaneous For issues that don't match any other spec label label Mar 20, 2024
@XSAM
Copy link
Member Author

XSAM commented Mar 26, 2024

If we allow arbitrary bytes in the string, we need to support the bytes type. Otherwise, the arbitrary bytes value won't be accessible when SDK reads the date from OTLP. (protobuf won't allow arbitrary bytes in string)

@pyohannes
Copy link
Contributor

A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not

I'm not sure if that's desirable, the vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs.

For example, if the specification would require to only allow valid UTF-8 characters, than this might cause issues with Go, where the string type can hold arbitrary bytes. If the specification would require to support arbitrary bytes, then this would cause problems in Rust, where the native string type is always UTF-8 encoded.

The current specification gives some freedom to languages, but gives clear rules of how to map language-specific string encodings to OTLP.

@pyohannes
Copy link
Contributor

There is also this: #780

@pellared
Copy link
Member

The vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs

I think it would be good to explicitly describe it in the specification.

@trask trask added triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:deciding:tc-inbox Needs attention from the TC in order to move forward labels Mar 26, 2024
@XSAM
Copy link
Member Author

XSAM commented Mar 27, 2024

From today's SIG meeting:

  • We won’t expose the bytes type in API before having discussions with vendors, as they might not ready to process bytes attributes from OTLP

Since we won't accept bytes outside of the log body, I think we could clarify we only allow valid UTF-8 in string. So our users can have a clear understanding of the behavior after passing bytes into a string value, instead of not knowing the result until configuring it out it would be dropped by the processing chain (like protobuf lib).

This change wouldn't cause issues in languages like Go, where the string type can hold arbitrary bytes, as the protobuf lib will drop the arbitrary bytes anyway.


I also think we could consider dropping this. This brings inconsistent features to languages, where some languages that allow the string type can hold arbitrary bytes can add bytes to OTLP (because it can map string to bytes), but others cannot do this because the native string type is always UTF-8 encoded.

String values which are not valid Unicode sequences SHOULD be converted to
AnyValue's
[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37)
with the bytes representing the string in the original order and format of the
source string.

It also provides a less straightforward user experience, as an attribute's type can implicitly change from string to bytes in OTLP.

@jack-berg
Copy link
Member

Discussed in the 3/27/24 TC meeting. There are two points to consider:

  • Whether additional specification is needed around the encoding of attribute string value types such that attribute APIs wouldn't accept non-UTF-8 strings. We think the existing language is sufficient, and reflects the reality that languages are different in how they represent strings and encodings. Limiting to UTF-8 seems overly restrictive.
  • Whether the spec for encoding non-UTF-8 strings is what we want. On one hand, it is does describe what to do quite clearly. On the other hand, consumers don't have information to determine the encoding, so its not very useful. We discussed options to drop non-UTF-8 encoded strings instead of sending as bytes, or to update the language such that the string encoded is sent as well.

Personally, I'm in favor of dropping non-UTF-8 bytes for the reason @XSAM mentions:

It also provides a less straightforward user experience, as an attribute's type can implicitly change from string to bytes in OTLP.

@XSAM
Copy link
Member Author

XSAM commented Mar 31, 2024

Created a PR for dropping non-UTF-8 string. #3970

@pellared
Copy link
Member

pellared commented Apr 2, 2024

It is language-specific how the string is encoded. For instance .NET encodes string as UTF-16. It is not a an issue as OTLP exporters are encoding them to UTF-8.

@jack-berg jack-berg added triage:deciding:needs-info Not enough information. It is left open to provide the author with time to add more details and removed triage:deciding:tc-inbox Needs attention from the TC in order to move forward labels Apr 10, 2024
@jack-berg
Copy link
Member

This has already been discussed in several spec SIGs and more work is needed to decide what direction we should go in. Options have included:

  • Drop non-UTF-8 strings
  • Keep current encoding, which converts non-UTF-8 strings to bytes without any metadata describing their original encoding
  • Adjust encoding, so that there is additional metadata describing the encoding of non-UTF-8 strings

Need more info about the issue and options before accepting.

@XSAM
Copy link
Member Author

XSAM commented Apr 15, 2024

Based on the comments from #3970, which was trying to drop non-UTF-8 converting and mandate Unicode in the string, dropping non-UTF-8 converting and mandating Unicode in the string are considered as breaking changes.

If dropping non-UTF-8 converting is not feasible, then it means we will have bytes type in attributes even if users originally used the string byte.

String values which are not valid Unicode sequences SHOULD be converted to
AnyValue's
[bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37)
with the bytes representing the string in the original order and format of the
source string.

For that, I suggest we could go for option 2, Keep current encoding, which converts non-UTF-8 strings to bytes without any metadata describing their original encoding, but we should also expose the bytes type in API (extending the list of supported primitives to include bytes type).

Reasons:

@jack-berg
Copy link
Member

Discussed in the 4/16/24 spec SIG.

In languages like go, strings are nothing but a byte array so the idea of being able to include encoding information doesn't really work because its unavailable. The benefit of the current spec'd approach of sending non-UTF-8 strings as byte values is that something is better than nothing: even if the encoding of the bytes isn't present, the information is still there and you can probably make sense of it with some a priori knowledge about the instrumentation which recorded it. Yet there are potential security considerations with letting raw unvalidated byte arrays escape. I won't try to articulate what that might look like but folks on the call expressed potential concerns.

A good compromise seems to be:

  • Keep the current behavior of encoding non-UTF-8 strings as byte value
  • Encourage SDKs to offer an opt-out, where non-UTF-8 strings are dropped

@jmacd
Copy link
Contributor

jmacd commented Apr 16, 2024

Keep the current behavior of encoding non-UTF-8 strings as byte value

Can we also specify that the OpenTelemetry Collector have an opt-in to support passing through invalid UTF-8? (This is a technical challenge, because protobuf libraries.) That's the problem with the current behavior -- I don't care if invalid UTF-8 arrives because it's better than nothing. However, I need a Collector that will support passing me that data.

@XSAM
Copy link
Member Author

XSAM commented Apr 16, 2024

I just want to provide more info here. Sorry if I didn't make this clear in the meeting, I think sending non-UTF-8 strings as byte is bonded with exposing bytes API.

If we provide sending non-UTF-8 strings as byte, like as a stable behavior that is not removable, it makes sense to me to provide the bytes type because they are basically the same feature (allow users to push bytes data). But, just providing sending non-UTF-8 strings as byte without providing bytes API just gives languages like Go a privilege to push bytes.

I think the point of having sending non-UTF-8 strings as byte is not because of a certain use case in which users want to push bytes; it is because we don't want to drop users' data. And, the bytes API is not tight to certain use cases; it allows us to have a design to provide uniform features (push bytes).

BTW, the potential security considerations are also challenging the sending non-UTF-8 strings as byte since we have a way to push bytes. However, I am not sure about the details of the potential security case, like what exactly the case is.

@pauldraper
Copy link

pauldraper commented Apr 25, 2024

Are we talking about a conceptual model, a language API, or a network protocol?

I find this project constantly confuses these.


The answer is that it is a conceptual model (which is then translated into APIs in various languages, and is the basis for a network protocol.)

And the conceptual model is Unicode text (or Unicode character string). Each language has it own prescription or convention of representing Unicode. Java has a native Unicode string type (java.lang.String), as does JavaScript and C#. Go uses a UTF-8 encoded byte array. C++ uses a UTF-8 encoded byte sequence (std::string). C uses UTF-8 encoded NUL-terminated byte arrays (if U+0000 is a prohibited character).

A prescribed "encoding" is irrelevant to the spec here and should be omitted. (But should be included in the OTel Go API definition, the OTLP specification, places where converting Unicode characters into octets is relevant.) If we want to specify requirements on the Unicode text (e.g. invalid characters) or specify that language SDKs should validate inputs, that would be relevant.

Let's stop leaking implementation details everywhere.

@jmacd
Copy link
Contributor

jmacd commented May 8, 2024

@pyohannes Would like to respond to #3950 (comment) which specifically uses Rust as an example. I have just been oncall this week and responded to an issue for telemetry data dropping because a Rust SDK submitted invalid utf8 strings to a collector, which was the first in the pipeline to validate utf8.

Please see my proposal here: #3421 (comment)

@jmacd
Copy link
Contributor

jmacd commented May 10, 2024

@XSAM I would like to hear your thoughts on open-telemetry/oteps#257, thanks.

@trask trask removed the triage:deciding:needs-info Not enough information. It is left open to provide the author with time to add more details label Jul 16, 2024
@trask trask added triage:deciding:needs-info Not enough information. It is left open to provide the author with time to add more details and removed triage:followup labels Sep 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:miscellaneous For issues that don't match any other spec label triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:deciding:needs-info Not enough information. It is left open to provide the author with time to add more details
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants