Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOC] Clarification regarding data prepper sinks. #7762

Open
2 of 4 tasks
nateynateynate opened this issue Jul 17, 2024 · 4 comments
Open
2 of 4 tasks

[DOC] Clarification regarding data prepper sinks. #7762

nateynateynate opened this issue Jul 17, 2024 · 4 comments
Labels
1 - Backlog Issue: The issue is unassigned or assigned but not started data-prepper

Comments

@nateynateynate
Copy link
Member

What do you want to do?

  • Request a change to existing documentation
  • Add new documentation
  • Report a technical problem with the documentation
  • Other

Tell us about your request. Provide a summary of the request.

Someone was asking about whether data prepper can "handle" apache avro data, and found that the documentation wasn't entirely clear. avro is listed as a codec for data prepper, but refers to it "most efficiently being used" in an S3 sink. Could we add a paragraph or so about how it can be used outside of an S3 sink?

Also - it seems to have some weird formatting oddities that make it a little hard to skim. See screenshots.

*Version: List the OpenSearch version to which this issue applies, e.g. 2.14, 2.12--2.14, or all.

2.15

What other resources are available? Provide links to related issues, POCs, steps for testing, etc.
image
image

@hdhalter
Copy link
Contributor

@dlvenable
Copy link
Member

Regarding the original question, Data Prepper can read Avro from S3 and write Avro to S3.

Regarding the documentation, we should revisit this page. The original intention was to clarify when a user should use a codec versus a processor for parsing input data.

I might reword this as:

Apache Avro is an open-source serialization format for record data. When reading Avro data you should use the avro codec.

@dlvenable
Copy link
Member

I also noticed some question comments about Parquet.

Apache Parquet is a columnar storage format built for Hadoop. It is most efficient without the use of a codec. Positive results, however, can be achieved when it’s configured with S3 Select.

Perhaps this should say:

Apache Parquet is a columnar storage format built for Hadoop. Pipeline authors can use the parquet codec to read Parquet data directly from the S3 object. This will retrieve all data from Parquet. An alternative is to use S3 Select instead of the codec. In this case, S3 Select will parse the Parquet file directly (additional S3 charges apply). This can be more efficient if you are filtering or loading a subset of data.

@hdhalter
Copy link
Contributor

@nateynateynate - Do you want to take a stab at pushing up the changes?

@hdhalter hdhalter added 1 - Backlog Issue: The issue is unassigned or assigned but not started data-prepper and removed untriaged labels Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - Backlog Issue: The issue is unassigned or assigned but not started data-prepper
Projects
None yet
Development

No branches or pull requests

3 participants