[filebeat][awss3] - Added support for parquet decoding and decoder config #35578

ShourieG · 2023-05-25T15:34:06Z

Type of change

Enhancement
Docs

What does this PR do?

This PR adds support for a new decoding config option inside the readerConfig struct along with support for
parquet file decoding using the libbeat parquet reader. The decoding config is created in such a manner that
in future we will be able to add more decoding codes as well as migrate decoding processes for JSON, NDJSON
files which currently occur based on the contentType config option.

An example of the new decoding config:

  decoding.codec.parquet.enabled: true
  decoding.codec.parquet.process_parallel: true
  decoding.codec.parquet.batch_size: 1000

Why is it important?

This change allows us to officially support parquet decoding for the s3 input and also enable integrations like
amazon security lake.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
~~- [ ] I have made corresponding change to the default configuration files~~
I have added tests that prove my fix is effective or that my feature works
I have added an entry in CHANGELOG.next.asciidoc.

Related issues

Closes [Filebeat S3 Input] Add support for Apache Parquet files #34662

… in awss3 package

…r comments

…ments

ShourieG · 2023-06-05T07:52:23Z

@andrewkroh wrt the questions you had -

If we were designing a decoder interface that we wanted to be able to apply other input sources, what would that API be? Imagine we wanted to be able to decode parquet files in GCS or simply from disk (filestream).
- I think if we were to have a generic decoder interface applicable for all inputs, then I would have the decoder API accept a decoder specific config and a io.Reader stream for initialisation. For the decode func, it would ideally return a data output channel and an error channel. This would be done so that any input specific logic is kept out of the decoder and the input can easily listen on the channels and extract the data along with any errors.
Should it allow decoder chaining? Why or why not?
- I feel chaining would not be a necessity and the decoder config should be extensive enough to define the exact behaviour. This also helps keep the api simple to use.
What's relationship (if any) between parser and decoders?
- If it's a generic decoder then it should not be concerned with parsing as that is the job of the input. Hence the output is channeled to keep any parsing specific logic out of the decoder itself.

For the current implementation, since its a decoder that is specific to the input, these design decisions were not made.

ShourieG · 2023-06-13T12:36:37Z

@andrewkroh I've refactored the decoder to be more generic, updated the tests and docs and added necessary comments. For now I've kept the implementation simple and very basic, suited for the current use case but this can be easily extended in future. I've not added a no-op decoder here since I need the nil value returned to branch on the legacy path. I've tried to make as little modifications as possible to the legacy logic so that complications are avoided.

ShourieG · 2023-06-15T10:08:38Z

@andrewkroh could you review the current implementation while we wait for the CI pipeline to be fixed, so that we can merge before FF, that would be really great as Crest will be taking up the security lake integration after this merge.

I opened a public issue: apache/arrow#36052 for the cross build errors on 32bit systems and they have been resolved with a recent PR, but still not updating the library since there is no stable release out with these fixes.

faec · 2023-06-15T17:18:07Z

Build fix is in review: #35789

x-pack/filebeat/input/awss3/s3_objects.go

x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc

yago82 · 2023-12-14T09:32:15Z

Hello everyone,

I'm facing an issue while trying to retrieve a Parquet file from S3 using Filebeat. Below, I've included configuration details:

filebeat.inputs:
- type: aws-s3
  bucket_arn: ${BUCKET_ARN}
  bucket_list_prefix: ${BUCKET_LIST_PREFIX}
  bucket_list_interval: 60s
  region: eu-west-1
  default_region: eu-west-1
  number_of_workers: 5
  access_key_id: ${ACCESS_KEY_ID}
  secret_access_key: ${SECRET_ACCESS_KEY}
  decoding.codec.parquet.enabled: true
  decoding.codec.parquet.process_parallel: true
  decoding.codec.parquet.batch_size: 1000

setup.template.enabled: false

processors:
  - add_fields:
      target: '@metadata'
      fields:
        op_type: "index"

output.elasticsearch:
  hosts: ["${ELASTICSEARCH_HOSTS}"]
  username: ${ELASTICSEARCH_USERNAME}
  password: ${ELASTICSEARCH_PASSWORD}
  protocol: https
  index: utenti123
  allow_older_versions: true

I have tried various bucket_list_prefix solutions including:

emr-serverless/user-output/
emr-serverless/user--output//
emr-serverless/user-output/*
emr-serverless/user-output/*/

However, we consistently encounter the following error:

failed processing S3 event for object key "emr-serverless/user-output/" in bucket "root-content": failed to create parquet decoder: failed to create parquet reader: parquet: file too small (size=0)

Any insights or suggestions on troubleshooting steps would be highly appreciated. Please let me know if additional information is needed.

Thank you

andrewkroh · 2024-01-03T04:36:50Z

Trying using file_selectors to selectively apply the parquet decoding to specific files.

https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-aws-s3.html#_file_selectors

kdHub · 2024-10-04T23:29:17Z

Thanks for the suggestion on file_selectors. I was using them and did not realize my issue until seeing this last comment and moving the decoding block into file_selector section solved my problem.

ShourieG added 30 commits April 24, 2023 13:20

initial commit for s3 parquet support

ffe109d

updated changelog

5295efd

Merge remote-tracking branch 'upstream/main' into awss3/parquet

0f5b475

added license updates

b41aa40

updated notice and go mod/sum

83598fa

Merge branch 'main' into awss3/parquet

1ad3fe9

Merge remote-tracking branch 'upstream/main' into awss3/parquet

f7c5498

removed libgering panic

ec642f5

added parquet benchmark tests

1664648

Merge remote-tracking branch 'upstream/main' into awss3/parquet

8f56a5e

updated osquery package due to update in dependant thrift package

4d090a3

added parquet reader with benchmark tests and implemented that reader…

b370093

… in awss3 package

Merge remote-tracking branch 'upstream/main' into awss3/parquet

e8e45af

addressed linting errors

2ff7b38

refactored parquet reader, added tests and benchmarks and addressed p…

2d8321b

…r comments

Merge remote-tracking branch 'upstream/main' into awss3/parquet

cbf864c

addressed pr comments

42b7d06

Merge remote-tracking branch 'upstream/main' into awss3/parquet

8119a06

resolved merged conflicts

6e6687d

resolved merged conflicts

2c9d32a

updated notice

8c536a4

added more parquet file tests with json comparisons, addressed pr com…

9b2e330

…ments

Merge remote-tracking branch 'upstream/main' into awss3/parquet

35df388

removed commented codeS

fc9c0c6

removed bad imports & cleaned up tests

ed6edca

updated notice

6384a11

added graceful closures with err checks in test

47c61a1

added graceful closures with err checks in test

3049ee5

merged with upstream and resolved conflicts

7330873

updated go sum

7aca5fa

ShourieG added the blocked label Jun 5, 2023

Merge remote-tracking branch 'upstream/main' into s3/parquet

0b8a71a

ShourieG added 6 commits June 8, 2023 18:04

Merge remote-tracking branch 'upstream/main' into s3/parquet

41f9885

Merge remote-tracking branch 'upstream/main' into s3/parquet

ba7927a

refactored the decoder interface

a0fb21f

updated docs

5005542

updated changelog

3f627e7

updated the tests to read prettified json

392e89e

ShourieG removed the blocked label Jun 13, 2023

ShourieG added 2 commits June 14, 2023 00:21

Merge remote-tracking branch 'upstream/main' into s3/parquet

c92bf31

Merge remote-tracking branch 'upstream/main' into s3/parquet

4b458e0

andrewkroh reviewed Jun 15, 2023

View reviewed changes

x-pack/filebeat/input/awss3/s3_objects.go Outdated Show resolved Hide resolved

ShourieG added 2 commits June 16, 2023 10:59

updated offset tracking logic

f296baf

Merge remote-tracking branch 'upstream/main' into s3/parquet

414bd70

zmoog approved these changes Jun 20, 2023

View reviewed changes

x-pack/filebeat/docs/inputs/input-aws-s3.asciidoc Outdated Show resolved Hide resolved

narph requested a review from andrewkroh June 20, 2023 08:17

ShourieG added 4 commits June 20, 2023 13:56

Merge remote-tracking branch 'upstream/main' into s3/parquet

24dafcf

updated docs

bf214a7

added event offset logic

e536a51

Merge remote-tracking branch 'upstream/main' into s3/parquet

f9d2859

ShourieG merged commit 9f35454 into elastic:main Jun 20, 2023

ShourieG deleted the s3/parquet branch June 20, 2023 13:52

jamiehynds mentioned this pull request Oct 26, 2023

[Amazon Security Lake] Parquet File Support elastic/elastic-serverless-forwarder#506

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[filebeat][awss3] - Added support for parquet decoding and decoder config #35578

[filebeat][awss3] - Added support for parquet decoding and decoder config #35578

ShourieG commented May 25, 2023 •

edited

Loading

ShourieG commented Jun 5, 2023 •

edited

Loading

ShourieG commented Jun 13, 2023 •

edited

Loading

ShourieG commented Jun 15, 2023

faec commented Jun 15, 2023

yago82 commented Dec 14, 2023

andrewkroh commented Jan 3, 2024

kdHub commented Oct 4, 2024

[filebeat][awss3] - Added support for parquet decoding and decoder config #35578

[filebeat][awss3] - Added support for parquet decoding and decoder config #35578

Conversation

ShourieG commented May 25, 2023 • edited Loading

Type of change

What does this PR do?

Why is it important?

Checklist

Related issues

ShourieG commented Jun 5, 2023 • edited Loading

ShourieG commented Jun 13, 2023 • edited Loading

ShourieG commented Jun 15, 2023

faec commented Jun 15, 2023

yago82 commented Dec 14, 2023

andrewkroh commented Jan 3, 2024

kdHub commented Oct 4, 2024

ShourieG commented May 25, 2023 •

edited

Loading

ShourieG commented Jun 5, 2023 •

edited

Loading

ShourieG commented Jun 13, 2023 •

edited

Loading