Description
When using Filebeat with the S3 input plugin (beta), Filebeat will fail processing files that contain JSON lines and are GZIP-compressed. This is the output format of AWS GuardDuty with S3 export enabled, which means that Filebeat is unable to process logs as written by AWS GuardDuty. The issue occurs specifically when the object contains JSON lines, is GZIP-compressed, and has the following metadata on the S3 object:
Content-Encoding: gzip
Content-Type: application/json
Note: I originally posted this as a thread on the discussion forum, but now I am confident it is a bug in Filebeat so I'm creating an issue here.
Actual Results
When using Filebeat 7.6.2 I get these error messages. In this case it seems that Filebeat is attempting to decompress the GZIP stream, but the stream has already been automatically decompressed by the transport in aws-sdk-go based on the object Metadata.
2020-05-20T22:42:27.973Z ERROR [s3] s3/input.go:447 gzip.NewReader failed: gzip: invalid header
2020-05-20T22:42:27.974Z ERROR [s3] s3/input.go:386 createEventsFromS3Info failed for AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/15/659b5608-a71c-3b42-8979-f851e61d9098.jsonl.gz: gzip.NewReader failed: gzip: invalid header
And when using Filebeat 7.7.0 I get slightly different error messages. This seems to stem from the fact that GuardDuty is incorrectly assigning the application/json
MIME type to files that actually contain JSON lines/newline-delimited JSON.
2020-05-21T19:41:28.122Z ERROR [s3] s3/input.go:434 expand_event_list_from_field parameter is missing in config for application/json content-type file
2020-05-21T19:41:28.122Z ERROR [s3] s3/input.go:387 createEventsFromS3Info failed for AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/14/8b55ad23-fe05-3f4e-8ff7-d61365cb2ad3.jsonl.gz: expand_event_list_from_field parameter is missing in config for application/json content-type file
Expected Results
Presumably GuardDuty should be using application/json-seq
or application/jsonstream
or application/x-json-stream
or application/x-ndjson
or application/x-jsonlines
but there doesn't seem to be a standard there. So it seems like Filebeat should be able to handle cases where JSON-lines/NDJson files are saved with the application/json
MIME type, perhaps with automatic detection or a configuration override.
The S3 input plugin should also be careful to not attempt GZIP decompression twice (once automatic in the transport layer, and once explicitly in the s3 input code).
Additional information
- I'm testing Filebeat inside a docker container based on ubuntu:18.04
- I'm downloading Filebeat directly from artifacts.elastic.co and using the dpkg install method
My Filebeat Configuration
filebeat.inputs:
- type: s3
enabled: true
queue_url: https://sqs.ca-central-1.amazonaws.com/123456789123/awslogs-guardduty
processors:
- decode_json_fields:
fields: ['message']
target: "aws.guardduty"
- timestamp:
field: "aws.guardduty.updatedAt"
layouts:
- '2006-01-02T15:04:05Z'
- add_fields:
target: "event"
fields:
dataset: "aws.guardduty"