Clarify the valid content in primitive type `string` #3950

XSAM · 2024-03-20T22:19:14Z

What are you trying to achieve?

A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not, for primitive type string of attribute on the common page.

A primitive type: string, boolean, double precision floating point (IEEE 754-1985) or signed 64 bit integer.

Additional context.

Baggage value only allows UTF-8 string.

opentelemetry-specification/specification/baggage/api.md

Line 48 in 7145a5d

Baggage **values** are any valid UTF-8 strings. Language API MUST accept

Attribute type mapping defines a way to handle non-UTF-8 in a string, but its doc status is Experimental.

opentelemetry-specification/specification/common/attribute-type-mapping.md

Lines 109 to 113 in 7145a5d

    
           String values which are not valid Unicode sequences SHOULD be converted to 
        
           AnyValue's 
        
           [bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) 
        
           with the bytes representing the string in the original order and format of the 
        
           source string.

Log data model defines the string represents as a Unicode string (UTF-8).

opentelemetry-specification/specification/logs/data-model-appendix.md

Line 492 in 7145a5d

    
           text_payload     | string             | The log entry payload, represented as a Unicode string (UTF-8). | Body

There is no clear guideline to determine whether a string type allows arbitrary bytes or not.

Related: open-telemetry/opentelemetry-go#5089 (review)

The text was updated successfully, but these errors were encountered:

XSAM · 2024-03-26T03:54:34Z

If we allow arbitrary bytes in the string, we need to support the bytes type. Otherwise, the arbitrary bytes value won't be accessible when SDK reads the date from OTLP. (protobuf won't allow arbitrary bytes in string)

pyohannes · 2024-03-26T13:13:01Z

A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not

I'm not sure if that's desirable, the vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs.

For example, if the specification would require to only allow valid UTF-8 characters, than this might cause issues with Go, where the string type can hold arbitrary bytes. If the specification would require to support arbitrary bytes, then this would cause problems in Rust, where the native string type is always UTF-8 encoded.

The current specification gives some freedom to languages, but gives clear rules of how to map language-specific string encodings to OTLP.

pyohannes · 2024-03-26T15:11:32Z

There is also this: #780

pellared · 2024-03-26T19:10:06Z

The vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs

I think it would be good to explicitly describe it in the specification.

XSAM · 2024-03-27T04:41:32Z

From today's SIG meeting:

We won’t expose the bytes type in API before having discussions with vendors, as they might not ready to process bytes attributes from OTLP

Since we won't accept bytes outside of the log body, I think we could clarify we only allow valid UTF-8 in string. So our users can have a clear understanding of the behavior after passing bytes into a string value, instead of not knowing the result until configuring it out it would be dropped by the processing chain (like protobuf lib).

This change wouldn't cause issues in languages like Go, where the string type can hold arbitrary bytes, as the protobuf lib will drop the arbitrary bytes anyway.

I also think we could consider dropping this. This brings inconsistent features to languages, where some languages that allow the string type can hold arbitrary bytes can add bytes to OTLP (because it can map string to bytes), but others cannot do this because the native string type is always UTF-8 encoded.

opentelemetry-specification/specification/common/attribute-type-mapping.md

Lines 109 to 113 in 7145a5d

    
           String values which are not valid Unicode sequences SHOULD be converted to 
        
           AnyValue's 
        
           [bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) 
        
           with the bytes representing the string in the original order and format of the 
        
           source string.

It also provides a less straightforward user experience, as an attribute's type can implicitly change from string to bytes in OTLP.

jack-berg · 2024-03-27T16:50:58Z

Discussed in the 3/27/24 TC meeting. There are two points to consider:

Whether additional specification is needed around the encoding of attribute string value types such that attribute APIs wouldn't accept non-UTF-8 strings. We think the existing language is sufficient, and reflects the reality that languages are different in how they represent strings and encodings. Limiting to UTF-8 seems overly restrictive.
Whether the spec for encoding non-UTF-8 strings is what we want. On one hand, it is does describe what to do quite clearly. On the other hand, consumers don't have information to determine the encoding, so its not very useful. We discussed options to drop non-UTF-8 encoded strings instead of sending as bytes, or to update the language such that the string encoded is sent as well.

Personally, I'm in favor of dropping non-UTF-8 bytes for the reason @XSAM mentions:

It also provides a less straightforward user experience, as an attribute's type can implicitly change from string to bytes in OTLP.

XSAM · 2024-03-31T04:51:56Z

Created a PR for dropping non-UTF-8 string. #3970

pellared · 2024-04-02T15:13:29Z

It is language-specific how the string is encoded. For instance .NET encodes string as UTF-16. It is not a an issue as OTLP exporters are encoding them to UTF-8.

jack-berg · 2024-04-10T16:08:07Z

This has already been discussed in several spec SIGs and more work is needed to decide what direction we should go in. Options have included:

Drop non-UTF-8 strings
Keep current encoding, which converts non-UTF-8 strings to bytes without any metadata describing their original encoding
Adjust encoding, so that there is additional metadata describing the encoding of non-UTF-8 strings

Need more info about the issue and options before accepting.

XSAM · 2024-04-15T20:52:24Z

Based on the comments from #3970, which was trying to drop non-UTF-8 converting and mandate Unicode in the string, dropping non-UTF-8 converting and mandating Unicode in the string are considered as breaking changes.

If dropping non-UTF-8 converting is not feasible, then it means we will have bytes type in attributes even if users originally used the string byte.

opentelemetry-specification/specification/common/attribute-type-mapping.md

Lines 109 to 113 in 700f513

    
           String values which are not valid Unicode sequences SHOULD be converted to 
        
           AnyValue's 
        
           [bytes_value](https://github.com/open-telemetry/opentelemetry-proto/blob/38b5b9b6e5257c6500a843f7fdacf89dd95833e8/opentelemetry/proto/common/v1/common.proto#L37) 
        
           with the bytes representing the string in the original order and format of the 
        
           source string.

For that, I suggest we could go for option 2, Keep current encoding, which converts non-UTF-8 strings to bytes without any metadata describing their original encoding, but we should also expose the bytes type in API (extending the list of supported primitives to include bytes type).

Reasons:

It provides uniformity: Every language can push bytes into OTLP instead of certain languages that allow arbitrary data in the language's native string type.
Extending the list of supported primitives to include the bytes type may not be a breaking change.
The reason for rejecting exposing bytes type in the API in a previous SIG meeting is our vendor might not ready to process bytes attributes from OTLP. But this is not valid if dropping non-UTF-8 converting is not feasible, it means we will have bytes in attributes in OTLP anyway regardless whether to expose the bytes type in the API.

jack-berg · 2024-04-16T15:56:56Z

Discussed in the 4/16/24 spec SIG.

In languages like go, strings are nothing but a byte array so the idea of being able to include encoding information doesn't really work because its unavailable. The benefit of the current spec'd approach of sending non-UTF-8 strings as byte values is that something is better than nothing: even if the encoding of the bytes isn't present, the information is still there and you can probably make sense of it with some a priori knowledge about the instrumentation which recorded it. Yet there are potential security considerations with letting raw unvalidated byte arrays escape. I won't try to articulate what that might look like but folks on the call expressed potential concerns.

A good compromise seems to be:

Keep the current behavior of encoding non-UTF-8 strings as byte value
Encourage SDKs to offer an opt-out, where non-UTF-8 strings are dropped

jmacd · 2024-04-16T16:09:07Z

Keep the current behavior of encoding non-UTF-8 strings as byte value

Can we also specify that the OpenTelemetry Collector have an opt-in to support passing through invalid UTF-8? (This is a technical challenge, because protobuf libraries.) That's the problem with the current behavior -- I don't care if invalid UTF-8 arrives because it's better than nothing. However, I need a Collector that will support passing me that data.

XSAM · 2024-04-16T16:44:10Z

I just want to provide more info here. Sorry if I didn't make this clear in the meeting, I think sending non-UTF-8 strings as byte is bonded with exposing bytes API.

If we provide sending non-UTF-8 strings as byte, like as a stable behavior that is not removable, it makes sense to me to provide the bytes type because they are basically the same feature (allow users to push bytes data). But, just providing sending non-UTF-8 strings as byte without providing bytes API just gives languages like Go a privilege to push bytes.

I think the point of having sending non-UTF-8 strings as byte is not because of a certain use case in which users want to push bytes; it is because we don't want to drop users' data. And, the bytes API is not tight to certain use cases; it allows us to have a design to provide uniform features (push bytes).

BTW, the potential security considerations are also challenging the sending non-UTF-8 strings as byte since we have a way to push bytes. However, I am not sure about the details of the potential security case, like what exactly the case is.

pauldraper · 2024-04-25T17:47:03Z

Are we talking about a conceptual model, a language API, or a network protocol?

I find this project constantly confuses these.

The answer is that it is a conceptual model (which is then translated into APIs in various languages, and is the basis for a network protocol.)

And the conceptual model is Unicode text (or Unicode character string). Each language has it own prescription or convention of representing Unicode. Java has a native Unicode string type (java.lang.String), as does JavaScript and C#. Go uses a UTF-8 encoded byte array. C++ uses a UTF-8 encoded byte sequence (std::string). C uses UTF-8 encoded NUL-terminated byte arrays (if U+0000 is a prohibited character).

A prescribed "encoding" is irrelevant to the spec here and should be omitted. (But should be included in the OTel Go API definition, the OTLP specification, places where converting Unicode characters into octets is relevant.) If we want to specify requirements on the Unicode text (e.g. invalid characters) or specify that language SDKs should validate inputs, that would be relevant.

Let's stop leaking implementation details everywhere.

jmacd · 2024-05-08T18:26:59Z

@pyohannes Would like to respond to #3950 (comment) which specifically uses Rust as an example. I have just been oncall this week and responded to an issue for telemetry data dropping because a Rust SDK submitted invalid utf8 strings to a collector, which was the first in the pipeline to validate utf8.

Please see my proposal here: #3421 (comment)

jmacd · 2024-05-10T23:30:18Z

@XSAM I would like to hear your thoughts on open-telemetry/oteps#257, thanks.

XSAM added the spec:miscellaneous For issues that don't match any other spec label label Mar 20, 2024

github-actions bot assigned yurishkuro Mar 20, 2024

XSAM mentioned this issue Mar 20, 2024

Add ByteSlice Value open-telemetry/opentelemetry-go#5089

Closed

pellared mentioned this issue Mar 26, 2024

Support for attribute.ByteSlice to avoid invalid-UTF-8 issues in instrumentation open-telemetry/opentelemetry-go#2314

Open

trask added triage:deciding:community-feedback Open to community discussion. If the community can provide sufficient reasoning, it may be accepted triage:deciding:tc-inbox Needs attention from the TC in order to move forward labels Mar 26, 2024

trask unassigned yurishkuro Mar 26, 2024

XSAM mentioned this issue Mar 31, 2024

Clarify string content must be valid Unicode sequences #3970

Closed

jack-berg added triage:deciding:needs-info Not enough information. It is left open to provide the author with time to add more details and removed triage:deciding:tc-inbox Needs attention from the TC in order to move forward labels Apr 10, 2024

austinlparker mentioned this issue Apr 23, 2024

Define restrictions for attribute and label keys #504

Closed

pellared mentioned this issue May 8, 2024

Clarify that API users SHOULD use valid UTF-8 for exportable data #3421

Open

jmacd mentioned this issue May 8, 2024

Create attribute.Value type for []byte values open-telemetry/opentelemetry-go#5324

Closed

jmacd mentioned this issue May 10, 2024

Invalid UTF-8 error handling policy open-telemetry/oteps#257

Open

trask removed the triage:deciding:needs-info Not enough information. It is left open to provide the author with time to add more details label Jul 16, 2024

github-actions bot added the triage:followup label Sep 19, 2024

trask added triage:deciding:needs-info Not enough information. It is left open to provide the author with time to add more details and removed triage:followup labels Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify the valid content in primitive type `string` #3950

Clarify the valid content in primitive type `string` #3950

XSAM commented Mar 20, 2024

XSAM commented Mar 26, 2024

pyohannes commented Mar 26, 2024

pyohannes commented Mar 26, 2024

pellared commented Mar 26, 2024

XSAM commented Mar 27, 2024 •

edited

Loading

jack-berg commented Mar 27, 2024

XSAM commented Mar 31, 2024

pellared commented Apr 2, 2024

jack-berg commented Apr 10, 2024

XSAM commented Apr 15, 2024 •

edited

Loading

jack-berg commented Apr 16, 2024

jmacd commented Apr 16, 2024

XSAM commented Apr 16, 2024 •

edited

Loading

pauldraper commented Apr 25, 2024 •

edited

Loading

jmacd commented May 8, 2024

jmacd commented May 10, 2024

Clarify the valid content in primitive type string #3950

Clarify the valid content in primitive type string #3950

Comments

XSAM commented Mar 20, 2024

XSAM commented Mar 26, 2024

pyohannes commented Mar 26, 2024

pyohannes commented Mar 26, 2024

pellared commented Mar 26, 2024

XSAM commented Mar 27, 2024 • edited Loading

jack-berg commented Mar 27, 2024

XSAM commented Mar 31, 2024

pellared commented Apr 2, 2024

jack-berg commented Apr 10, 2024

XSAM commented Apr 15, 2024 • edited Loading

jack-berg commented Apr 16, 2024

jmacd commented Apr 16, 2024

XSAM commented Apr 16, 2024 • edited Loading

pauldraper commented Apr 25, 2024 • edited Loading

jmacd commented May 8, 2024

jmacd commented May 10, 2024

Clarify the valid content in primitive type `string` #3950

Clarify the valid content in primitive type `string` #3950

XSAM commented Mar 27, 2024 •

edited

Loading

XSAM commented Apr 15, 2024 •

edited

Loading

XSAM commented Apr 16, 2024 •

edited

Loading

pauldraper commented Apr 25, 2024 •

edited

Loading