[DOC] Clarification regarding data prepper sinks. #7762

nateynateynate · 2024-07-17T20:25:51Z

What do you want to do?

Request a change to existing documentation
Add new documentation
Report a technical problem with the documentation
Other

Tell us about your request. Provide a summary of the request.

Someone was asking about whether data prepper can "handle" apache avro data, and found that the documentation wasn't entirely clear. avro is listed as a codec for data prepper, but refers to it "most efficiently being used" in an S3 sink. Could we add a paragraph or so about how it can be used outside of an S3 sink?

Also - it seems to have some weird formatting oddities that make it a little hard to skim. See screenshots.

*Version: List the OpenSearch version to which this issue applies, e.g. 2.14, 2.12--2.14, or all.

2.15

What other resources are available? Provide links to related issues, POCs, steps for testing, etc.

The text was updated successfully, but these errors were encountered:

hdhalter · 2024-07-17T21:00:45Z

@dlvenable - Can you please comment on this? Here is the link: https://opensearch.org/docs/latest/data-prepper/common-use-cases/codec-processor-combinations/#avro

dlvenable · 2024-07-18T22:38:37Z

Regarding the original question, Data Prepper can read Avro from S3 and write Avro to S3.

Regarding the documentation, we should revisit this page. The original intention was to clarify when a user should use a codec versus a processor for parsing input data.

I might reword this as:

Apache Avro is an open-source serialization format for record data. When reading Avro data you should use the avro codec.

dlvenable · 2024-07-18T22:41:00Z

I also noticed some question comments about Parquet.

Apache Parquet is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it’s configured with S3 Select.

Perhaps this should say:

Apache Parquet is a columnar storage format built for Hadoop. Pipeline authors can use the parquet codec to read Parquet data directly from the S3 object. This will retrieve all data from Parquet. An alternative is to use S3 Select instead of the codec. In this case, S3 Select will parse the Parquet file directly (additional S3 charges apply). This can be more efficient if you are filtering or loading a subset of data.

hdhalter · 2024-07-19T18:32:18Z

@nateynateynate - Do you want to take a stab at pushing up the changes?

nateynateynate added the untriaged label Jul 17, 2024

hdhalter added 1 - Backlog Issue: The issue is unassigned or assigned but not started data-prepper and removed untriaged labels Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOC] Clarification regarding data prepper sinks. #7762

[DOC] Clarification regarding data prepper sinks. #7762

nateynateynate commented Jul 17, 2024

hdhalter commented Jul 17, 2024

dlvenable commented Jul 18, 2024

dlvenable commented Jul 18, 2024

hdhalter commented Jul 19, 2024

[DOC] Clarification regarding data prepper sinks. #7762

[DOC] Clarification regarding data prepper sinks. #7762

Comments

nateynateynate commented Jul 17, 2024

hdhalter commented Jul 17, 2024

dlvenable commented Jul 18, 2024

dlvenable commented Jul 18, 2024

hdhalter commented Jul 19, 2024