Skip to content

Fails processing jsonl+gzip when using S3 Input plugin #18696

Closed
@nhnicwaller

Description

When using Filebeat with the S3 input plugin (beta), Filebeat will fail processing files that contain JSON lines and are GZIP-compressed. This is the output format of AWS GuardDuty with S3 export enabled, which means that Filebeat is unable to process logs as written by AWS GuardDuty. The issue occurs specifically when the object contains JSON lines, is GZIP-compressed, and has the following metadata on the S3 object:

Content-Encoding: gzip
Content-Type: application/json

Note: I originally posted this as a thread on the discussion forum, but now I am confident it is a bug in Filebeat so I'm creating an issue here.

Actual Results

When using Filebeat 7.6.2 I get these error messages. In this case it seems that Filebeat is attempting to decompress the GZIP stream, but the stream has already been automatically decompressed by the transport in aws-sdk-go based on the object Metadata.

2020-05-20T22:42:27.973Z	ERROR	[s3]	s3/input.go:447	gzip.NewReader failed: gzip: invalid header
2020-05-20T22:42:27.974Z	ERROR	[s3]	s3/input.go:386	createEventsFromS3Info failed for AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/15/659b5608-a71c-3b42-8979-f851e61d9098.jsonl.gz: gzip.NewReader failed: gzip: invalid header

And when using Filebeat 7.7.0 I get slightly different error messages. This seems to stem from the fact that GuardDuty is incorrectly assigning the application/json MIME type to files that actually contain JSON lines/newline-delimited JSON.

2020-05-21T19:41:28.122Z	ERROR	[s3]	s3/input.go:434	expand_event_list_from_field parameter is missing in config for application/json content-type file
2020-05-21T19:41:28.122Z	ERROR	[s3]	s3/input.go:387	createEventsFromS3Info failed for AWSLogs/123456789123/GuardDuty/ca-central-1/2020/05/14/8b55ad23-fe05-3f4e-8ff7-d61365cb2ad3.jsonl.gz: expand_event_list_from_field parameter is missing in config for application/json content-type file

Expected Results

Presumably GuardDuty should be using application/json-seq or application/jsonstream or application/x-json-stream or application/x-ndjson or application/x-jsonlines but there doesn't seem to be a standard there. So it seems like Filebeat should be able to handle cases where JSON-lines/NDJson files are saved with the application/json MIME type, perhaps with automatic detection or a configuration override.

The S3 input plugin should also be careful to not attempt GZIP decompression twice (once automatic in the transport layer, and once explicitly in the s3 input code).

Additional information

  • I'm testing Filebeat inside a docker container based on ubuntu:18.04
  • I'm downloading Filebeat directly from artifacts.elastic.co and using the dpkg install method

My Filebeat Configuration

filebeat.inputs:
  - type: s3
    enabled: true
    queue_url: https://sqs.ca-central-1.amazonaws.com/123456789123/awslogs-guardduty

processors:
  - decode_json_fields:
      fields: ['message']
      target: "aws.guardduty"
  - timestamp:
      field: "aws.guardduty.updatedAt"
      layouts:
        - '2006-01-02T15:04:05Z'
  - add_fields:
      target: "event"
      fields:
        dataset: "aws.guardduty"

Metadata

Assignees

Labels

Team:PlatformsLabel for the Integrations - Platforms team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions