-
Notifications
You must be signed in to change notification settings - Fork 888
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clarify the valid content in primitive type string
#3950
Comments
If we allow arbitrary bytes in the string, we need to support the |
I'm not sure if that's desirable, the vagueness in the spec seems to give languages the freedom to use language-specific string types in OTel APIs. For example, if the specification would require to only allow valid UTF-8 characters, than this might cause issues with Go, where the string type can hold arbitrary bytes. If the specification would require to support arbitrary bytes, then this would cause problems in Rust, where the native string type is always UTF-8 encoded. The current specification gives some freedom to languages, but gives clear rules of how to map language-specific string encodings to OTLP. |
There is also this: #780 |
I think it would be good to explicitly describe it in the specification. |
From today's SIG meeting:
Since we won't accept This change wouldn't cause issues in languages like Go, where the string type can hold arbitrary bytes, as the protobuf lib will drop the arbitrary bytes anyway. I also think we could consider dropping this. This brings inconsistent features to languages, where some languages that allow the string type can hold arbitrary bytes can add bytes to OTLP (because it can map string to bytes), but others cannot do this because the native string type is always UTF-8 encoded. opentelemetry-specification/specification/common/attribute-type-mapping.md Lines 109 to 113 in 7145a5d
It also provides a less straightforward user experience, as an attribute's type can implicitly change from |
Discussed in the 3/27/24 TC meeting. There are two points to consider:
Personally, I'm in favor of dropping non-UTF-8 bytes for the reason @XSAM mentions:
|
Created a PR for dropping non-UTF-8 string. #3970 |
It is language-specific how the string is encoded. For instance .NET encodes string as UTF-16. It is not a an issue as OTLP exporters are encoding them to UTF-8. |
This has already been discussed in several spec SIGs and more work is needed to decide what direction we should go in. Options have included:
Need more info about the issue and options before accepting. |
Based on the comments from #3970, which was trying to drop non-UTF-8 converting and mandate Unicode in the string, dropping non-UTF-8 converting and mandating Unicode in the string are considered as breaking changes.
If dropping non-UTF-8 converting is not feasible, then it means we will have opentelemetry-specification/specification/common/attribute-type-mapping.md Lines 109 to 113 in 700f513
For that, I suggest we could go for option 2, Reasons:
|
Discussed in the 4/16/24 spec SIG. In languages like go, strings are nothing but a byte array so the idea of being able to include encoding information doesn't really work because its unavailable. The benefit of the current spec'd approach of sending non-UTF-8 strings as byte values is that something is better than nothing: even if the encoding of the bytes isn't present, the information is still there and you can probably make sense of it with some a priori knowledge about the instrumentation which recorded it. Yet there are potential security considerations with letting raw unvalidated byte arrays escape. I won't try to articulate what that might look like but folks on the call expressed potential concerns. A good compromise seems to be:
|
Can we also specify that the OpenTelemetry Collector have an opt-in to support passing through invalid UTF-8? (This is a technical challenge, because protobuf libraries.) That's the problem with the current behavior -- I don't care if invalid UTF-8 arrives because it's better than nothing. However, I need a Collector that will support passing me that data. |
I just want to provide more info here. Sorry if I didn't make this clear in the meeting, I think If we provide I think the point of having BTW, the potential security considerations are also challenging the |
Are we talking about a conceptual model, a language API, or a network protocol? I find this project constantly confuses these. The answer is that it is a conceptual model (which is then translated into APIs in various languages, and is the basis for a network protocol.) And the conceptual model is Unicode text (or Unicode character string). Each language has it own prescription or convention of representing Unicode. Java has a native Unicode string type (java.lang.String), as does JavaScript and C#. Go uses a UTF-8 encoded byte array. C++ uses a UTF-8 encoded byte sequence (std::string). C uses UTF-8 encoded NUL-terminated byte arrays (if U+0000 is a prohibited character). A prescribed "encoding" is irrelevant to the spec here and should be omitted. (But should be included in the OTel Go API definition, the OTLP specification, places where converting Unicode characters into octets is relevant.) If we want to specify requirements on the Unicode text (e.g. invalid characters) or specify that language SDKs should validate inputs, that would be relevant. Let's stop leaking implementation details everywhere. |
@pyohannes Would like to respond to #3950 (comment) which specifically uses Rust as an example. I have just been oncall this week and responded to an issue for telemetry data dropping because a Rust SDK submitted invalid utf8 strings to a collector, which was the first in the pipeline to validate utf8. Please see my proposal here: #3421 (comment) |
@XSAM I would like to hear your thoughts on open-telemetry/oteps#257, thanks. |
What are you trying to achieve?
A clear definition of valid content, whether it allows arbitrary bytes (non-UTF-8) in string or not, for primitive type
string
of attribute on the common page.Additional context.
Baggage value only allows UTF-8 string.
opentelemetry-specification/specification/baggage/api.md
Line 48 in 7145a5d
Attribute type mapping defines a way to handle non-UTF-8 in a string, but its doc status is Experimental.
opentelemetry-specification/specification/common/attribute-type-mapping.md
Lines 109 to 113 in 7145a5d
Log data model defines the string represents as a Unicode string (UTF-8).
opentelemetry-specification/specification/logs/data-model-appendix.md
Line 492 in 7145a5d
There is no clear guideline to determine whether a string type allows arbitrary bytes or not.
Related: open-telemetry/opentelemetry-go#5089 (review)
The text was updated successfully, but these errors were encountered: