-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support parsing a headerless jarray (expected behavior is similar to csv_parser) #30321
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
Thanks for the suggestion @RoeiDimi. Technically, I believe you could solve this by placing a regex parser in front of the csv parser. However, removing prefixes and suffixes could arguably be a general enough case where we could make the functionality available independently. Generally speaking, our operators are designed with composability in mind because this allows us to solve many more use cases, so either way I don't think it should be part of the csv parser only. Have you tried removing the brackets with a regex parser yet? If not, can we start there and see how it performs? |
Thanks @djaglowski for the quick reply! After giving it some thought and looking at examples, I think my initial description of the problem was a bit misleading. The difference between a CSV line and a JArray is not actually just the surrounding brackets, its also a bunch of parsing and escaping differences. a few very simple examples -
Theoretically, I'm sure this can be somehow achieved using a regex but I think it's partially implementing a json parsing library.. It may be error-prone and probably hurt the performance, what do you think? Also, what do you think about the alternative suggestion of implementing a new operator JArray_parser? |
I think I understand better why this requires new functionality.
The one thing that stands out to me here is that we'd be duplicating a lot of functionality with the csv parser, so I see why you proposed making it part of the same. That said, still don't think it would be a great idea to pack other non-csv formats into the same operator. Here's another idea - what if we introduce a |
Thanks @djaglowski |
Hi @djaglowski in my case the input data looks like this:
the exporter then uses these attributes, creates a json and sends it to our Microsoft log analytics workspace (which is expecting a json with matching key-value per column in the destination table and has a type definition for each column). the output json in that case is:
but it should be (notice the numbers..) I think the issue here is that csv_parser does 2 jobs:
If we take out the 2nd job into a different parser (or transformer im not sure) and call it something like "headers_matching_parser" then csv_parser can handle only parsing a csv into a list of strings and a new jarray_parser can be introduced and do only parsing from jarray to a list of strings and they both could use headers_matching_parser after that if needed What would be the next best way in your eyes? |
@RoeiDimi, I appreciate you going through the effort to prototype the transformer. Losing the type information does appear to be a good enough reason to make this more than a conversion to csv. I think it's probably not worth the complexity to users to change the functionality from the current csv parser, so a standalone parser makes sense to me at this point. |
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> We are using otel-collector as an infrastructure and receive many types of data from a client. The client's sent data is always a form of json and in one use case the json is a simple headerless jarray and so we need a way to parse it and match headers to each field (something similar to what csv_parser does but also supports types supported in a json format and nested objects) **Link to tracking Issue:** <Issue number if applicable> #30321 **Testing:** <Describe what testing was performed and which tests were added.> * unittests All the tests found in csv_parser were copied and adjusted adding test scenarios for different types (numbers, booleans, null) as well as a test for parsing a nested object as a part of the jarray * End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table **Documentation:** <Describe the documentation added.> * [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/json_array_parser.md) * [assign_keys.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/assign_keys.md) --------- Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
Closed by #30644 |
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> Adding a feature following #30644. This feature allow json_array_parser parser to accept a comma-delimited header and for every json array it parses, output a map which contains the header fileds as keys and the matching values are the ones parsed from the input json array. This feature as added mainly for performance reasons as from a functional POV, this is mostly similar to chaining the 2 operators: `json_array_parser -> assign_keys ` **Link to tracking Issue:** <Issue number if applicable> #30321 **Testing:** <Describe what testing was performed and which tests were added.> - unittests - End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table and performance looks good **Documentation:** <Describe the documentation added.> - [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)
…try#30644) **Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> We are using otel-collector as an infrastructure and receive many types of data from a client. The client's sent data is always a form of json and in one use case the json is a simple headerless jarray and so we need a way to parse it and match headers to each field (something similar to what csv_parser does but also supports types supported in a json format and nested objects) **Link to tracking Issue:** <Issue number if applicable> open-telemetry#30321 **Testing:** <Describe what testing was performed and which tests were added.> * unittests All the tests found in csv_parser were copied and adjusted adding test scenarios for different types (numbers, booleans, null) as well as a test for parsing a nested object as a part of the jarray * End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table **Documentation:** <Describe the documentation added.> * [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/json_array_parser.md) * [assign_keys.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/beacea489ff4ae61c0bac4f477c04748944c9054/pkg/stanza/docs/operators/assign_keys.md) --------- Co-authored-by: Daniel Jaglowski <jaglows3@gmail.com>
**Description:** <Describe what has changed.> <!--Ex. Fixing a bug - Describe the bug and how this fixes the issue. Ex. Adding a feature - Explain what this achieves.--> Adding a feature following open-telemetry#30644. This feature allow json_array_parser parser to accept a comma-delimited header and for every json array it parses, output a map which contains the header fileds as keys and the matching values are the ones parsed from the input json array. This feature as added mainly for performance reasons as from a functional POV, this is mostly similar to chaining the 2 operators: `json_array_parser -> assign_keys ` **Link to tracking Issue:** <Issue number if applicable> open-telemetry#30321 **Testing:** <Describe what testing was performed and which tests were added.> - unittests - End to end tests Used generated traffic on a running otel collector thats using the parser and verified the data is as expected in the end table and performance looks good **Documentation:** <Describe the documentation added.> - [json_array_parser.md](https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/58cc91ca30eabbd35c074d79db8630fc474164d9/pkg/stanza/docs/operators/json_array_parser.md)
Component(s)
pkg/stanza
Is your feature request related to a problem? Please describe.
csv_parser operator supports getting a list of headers and parsing an input csv line into attributes. there is no such support for jarray lines (ie receiving a similar comma separated line that is wrapped in brackets)
Describe the solution you'd like
Adding a flag 'is_jarray' to csv_parser. to demonstrate it, the operator config can then look something like this:
and then add a suitable generate parse function (like generateJarrayParseFunc in this draft I created for example)
This will allow maximal code reuse as the expected behavior is basically the same as for a csv line besides the parsing (receiving headerless data, getting a pre-determined list of headers in the config and parsing it into attributes)
Describe alternatives you've considered
Implementing jarray_parser with the downside being most of csv_parser's code being duplicated
Additional context
This is a part of a bigger project in which we are using otel-collector as an infrastructure and receive many types of data from a client. The client's sent data is always a form of json and this use case is a subset in which the json is a simple headerless jarray and so we need a way to parse it in this manner
The text was updated successfully, but these errors were encountered: