Skip to content

Filebeat S3 input plugin cannot parse jsonl file with content-type set as application/json #19902

Closed
@lag13

Description

Hey! I had created this topic two weeks ago https://discuss.elastic.co/t/filebeat-s3-cannot-parse-jsonl-file-whos-content-type-is-set-to-application-json/239374 but no one has responded and I am convinced that this is a bug or something that should be fixed so I thought I'd try making an issue. I appreciate any feedback and sorry if this is just extra noise for y'all.

The problem is that the filebeat S3 input plugin cannot process an s3 object who's content-type is application/json AND the object content is a separate json object per line (i.e. jsonl). Processing such an object used to be possible until v7.7.0 when the S3 input plugin started enforcing json parsing if it saw a content-type of application/json: https://github.com/elastic/beats/blob/v7.7.0/x-pack/filebeat/input/s3/input.go#L432
Before that, json processing was ONLY controlled by the expand_event_list_from_field configuration like you can see here: https://github.com/elastic/beats/blob/v7.6.2/x-pack/filebeat/input/s3/input.go#L431

Now, it's probably the case that content-type on the s3 object should NOT be application/json in the first place but sadly I do not have control over that 😦. I'm essentially dealing with the same problem as #18696 but cloudflare is the entity pushing the logs to s3 (where for him it is AWS GuardDuty) and I don't have control over how cloudflare sets content-type.

I had reproduced this issue by compiling filebeat off of master two weeks ago (here was the filbeat version output filebeat version 8.0.0 (amd64), libbeat 8.0.0 [3341c1bca5626d1ee90af692617f10f58695ed1c built 2020-06-30 20:15:28 +0000 UTC]) and, looking at the source code today (2297636), it seems this problem is still there. Here were the steps I followed when I reproduced this issue:

$ cat s3filebeat.log 
{"id": "0001", "hey": "there", "how": {"are": "you"}}
{"id": "0002", "hope": "you", "are": {"doing": "well"}}
{"id": "0003", "I": "am", "doing": {"O": "K"}}

$ gzip s3filebeat.log 

$ aws --profile PROFILE s3api put-object --body ./s3filebeat.log.gz --bucket lucas-test-filebeat-s3 --content-encoding gzip --content-type application/json --key s3filebeat.log.gz
{
    "ETag": "\"955ed9f01b6ee38dbba167daab9ebbbb\""
}

$ cat filebeat.yml 
filebeat.inputs:
- type: s3
  queue_url: "https://sqs.us-east-1.amazonaws.com/ACCTNUM/cloudflare_logs_dev"
  role_arn: "arn:aws:iam::ACCTNUM:role/cloudflare_logs_filebeat_access_s3_sqs_dev"

output.console:
  pretty: true

$ ./filebeat -e
2020-06-30T15:39:04.180-0500	INFO	instance/beat.go:628	Home path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Config path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Data path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data] Logs path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs]
2020-06-30T15:39:04.180-0500	INFO	instance/beat.go:636	Beat ID: 2ef98d6a-ef7c-4885-a258-b4341a63b43b
2020-06-30T15:39:04.181-0500	INFO	[seccomp]	seccomp/seccomp.go:124	Syscall filter successfully installed
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:964	Beat info	{"system_info": {"beat": {"path": {"config": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "data": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data", "home": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "logs": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs"}, "type": "filebeat", "uuid": "2ef98d6a-ef7c-4885-a258-b4341a63b43b"}}}
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:973	Build info	{"system_info": {"build": {"commit": "3341c1bca5626d1ee90af692617f10f58695ed1c", "libbeat": "8.0.0", "time": "2020-06-30T20:15:28.000Z", "version": "8.0.0"}}}
2020-06-30T15:39:04.181-0500	INFO	[beat]	instance/beat.go:976	Go runtime info	{"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":12,"version":"go1.13.3"}}}
2020-06-30T15:39:04.182-0500	INFO	[beat]	instance/beat.go:980	Host info	{"system_info": {"host": {"architecture":"x86_64","boot_time":"2020-06-22T10:53:41-05:00","containerized":false,"name":"lgroenendaal-XPS-15-9570","ip":["127.0.0.1/8","::1/128","192.168.86.24/24","fe80::c38c:a654:ca3e:d0dd/64","10.20.211.109/32","fe80::d2ee:dd77:2943:aa45/64","172.17.0.1/16"],"kernel_version":"5.3.0-59-generic","mac":["9c:b6:d0:c6:01:39","02:42:e2:23:9b:84"],"os":{"family":"debian","platform":"ubuntu","name":"Ubuntu","version":"18.04.2 LTS (Bionic Beaver)","major":18,"minor":4,"patch":2,"codename":"bionic"},"timezone":"CDT","timezone_offset_sec":-18000,"id":"5a8843fa712d481595ebd41926cda45f"}}}
2020-06-30T15:39:04.183-0500	INFO	[beat]	instance/beat.go:1009	Process info	{"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":null,"effective":null,"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "exe": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/filebeat", "name": "filebeat", "pid": 5578, "ppid": 4156, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2020-06-30T15:39:03.640-0500"}}}
2020-06-30T15:39:04.183-0500	INFO	instance/beat.go:298	Setup Beat: filebeat; Version: 8.0.0
2020-06-30T15:39:04.183-0500	INFO	[publisher]	pipeline/module.go:113	Beat name: lgroenendaal-XPS-15-9570
2020-06-30T15:39:04.184-0500	WARN	beater/filebeat.go:151	Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500	INFO	[monitoring]	log/log.go:118	Starting metrics logging every 30s
2020-06-30T15:39:04.184-0500	INFO	instance/beat.go:449	filebeat start running.
2020-06-30T15:39:04.184-0500	WARN	beater/filebeat.go:251	Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500	INFO	registrar/registrar.go:145	Loading registrar data from /home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data/registry/filebeat/data.json
2020-06-30T15:39:04.184-0500	INFO	registrar/registrar.go:152	States Loaded from registrar: 0
2020-06-30T15:39:04.184-0500	INFO	[crawler]	beater/crawler.go:71	Loading Inputs: 1
2020-06-30T15:39:04.185-0500	WARN	[cfgwarn]	s3/input.go:131	BETA: s3 input type is used
2020-06-30T15:39:04.185-0500	INFO	[crawler]	beater/crawler.go:141	Starting input (ID: 18222034013403473169)
2020-06-30T15:39:04.185-0500	INFO	[crawler]	beater/crawler.go:108	Loading and starting Inputs completed. Enabled inputs: 1
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:173	visibility timeout is set to 300 seconds
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:174	aws api timeout is set to 2m0s
2020-06-30T15:39:04.186-0500	INFO	[s3]	s3/input.go:196	s3 input worker has started. with queueURL: https://sqs.us-east-1.amazonaws.com/374144443638/cloudflare_logs_dev
2020-06-30T15:39:14.633-0500	ERROR	[s3]	s3/input.go:458	expand_event_list_from_field parameter is missing in config for application/json content-type file
2020-06-30T15:39:14.633-0500	ERROR	[s3]	s3/input.go:393	createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file

The imporant error is the last line which tells me that my object cannot be processed even though I think it should be processable: s3/input.go:393 createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file.

If I try, just for fun, to include the expand_event_list_from_field config it will, understandably, fail to parse and we'll get the WARN log: s3/input.go:542 decode json failed for 's3filebeat.log.gz' from S3 bucket 'lucas-test-filebeat-s3', skipping this file: json: cannot unmarshal string into Go value of type []interface {}.

Any advice on how to proceed here? Personally I think the appropriate change would be to ignore content-type: application/json and have expand_event_list_from_field be the ONLY thing controlling whether or not to parse the object content as JSON (which is how the logic used to be) but I defer to you maintainers because you have the vision for how you want the code to behave.

I really appreciate any who took the time to read this and I really apprecate the creation of this feature, it's a nifty one. Cheers.

Metadata

Assignees

Labels

Team:PlatformsLabel for the Integrations - Platforms team

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions