Description
Hey! I had created this topic two weeks ago https://discuss.elastic.co/t/filebeat-s3-cannot-parse-jsonl-file-whos-content-type-is-set-to-application-json/239374 but no one has responded and I am convinced that this is a bug or something that should be fixed so I thought I'd try making an issue. I appreciate any feedback and sorry if this is just extra noise for y'all.
The problem is that the filebeat S3 input plugin cannot process an s3 object who's content-type
is application/json
AND the object content is a separate json object per line (i.e. jsonl). Processing such an object used to be possible until v7.7.0 when the S3 input plugin started enforcing json parsing if it saw a content-type
of application/json
: https://github.com/elastic/beats/blob/v7.7.0/x-pack/filebeat/input/s3/input.go#L432
Before that, json processing was ONLY controlled by the expand_event_list_from_field configuration
like you can see here: https://github.com/elastic/beats/blob/v7.6.2/x-pack/filebeat/input/s3/input.go#L431
Now, it's probably the case that content-type
on the s3 object should NOT be application/json
in the first place but sadly I do not have control over that 😦. I'm essentially dealing with the same problem as #18696 but cloudflare is the entity pushing the logs to s3 (where for him it is AWS GuardDuty) and I don't have control over how cloudflare sets content-type
.
I had reproduced this issue by compiling filebeat off of master two weeks ago (here was the filbeat version output filebeat version 8.0.0 (amd64), libbeat 8.0.0 [3341c1bca5626d1ee90af692617f10f58695ed1c built 2020-06-30 20:15:28 +0000 UTC]
) and, looking at the source code today (2297636), it seems this problem is still there. Here were the steps I followed when I reproduced this issue:
$ cat s3filebeat.log
{"id": "0001", "hey": "there", "how": {"are": "you"}}
{"id": "0002", "hope": "you", "are": {"doing": "well"}}
{"id": "0003", "I": "am", "doing": {"O": "K"}}
$ gzip s3filebeat.log
$ aws --profile PROFILE s3api put-object --body ./s3filebeat.log.gz --bucket lucas-test-filebeat-s3 --content-encoding gzip --content-type application/json --key s3filebeat.log.gz
{
"ETag": "\"955ed9f01b6ee38dbba167daab9ebbbb\""
}
$ cat filebeat.yml
filebeat.inputs:
- type: s3
queue_url: "https://sqs.us-east-1.amazonaws.com/ACCTNUM/cloudflare_logs_dev"
role_arn: "arn:aws:iam::ACCTNUM:role/cloudflare_logs_filebeat_access_s3_sqs_dev"
output.console:
pretty: true
$ ./filebeat -e
2020-06-30T15:39:04.180-0500 INFO instance/beat.go:628 Home path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Config path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat] Data path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data] Logs path: [/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs]
2020-06-30T15:39:04.180-0500 INFO instance/beat.go:636 Beat ID: 2ef98d6a-ef7c-4885-a258-b4341a63b43b
2020-06-30T15:39:04.181-0500 INFO [seccomp] seccomp/seccomp.go:124 Syscall filter successfully installed
2020-06-30T15:39:04.181-0500 INFO [beat] instance/beat.go:964 Beat info {"system_info": {"beat": {"path": {"config": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "data": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data", "home": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "logs": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/logs"}, "type": "filebeat", "uuid": "2ef98d6a-ef7c-4885-a258-b4341a63b43b"}}}
2020-06-30T15:39:04.181-0500 INFO [beat] instance/beat.go:973 Build info {"system_info": {"build": {"commit": "3341c1bca5626d1ee90af692617f10f58695ed1c", "libbeat": "8.0.0", "time": "2020-06-30T20:15:28.000Z", "version": "8.0.0"}}}
2020-06-30T15:39:04.181-0500 INFO [beat] instance/beat.go:976 Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":12,"version":"go1.13.3"}}}
2020-06-30T15:39:04.182-0500 INFO [beat] instance/beat.go:980 Host info {"system_info": {"host": {"architecture":"x86_64","boot_time":"2020-06-22T10:53:41-05:00","containerized":false,"name":"lgroenendaal-XPS-15-9570","ip":["127.0.0.1/8","::1/128","192.168.86.24/24","fe80::c38c:a654:ca3e:d0dd/64","10.20.211.109/32","fe80::d2ee:dd77:2943:aa45/64","172.17.0.1/16"],"kernel_version":"5.3.0-59-generic","mac":["9c:b6:d0:c6:01:39","02:42:e2:23:9b:84"],"os":{"family":"debian","platform":"ubuntu","name":"Ubuntu","version":"18.04.2 LTS (Bionic Beaver)","major":18,"minor":4,"patch":2,"codename":"bionic"},"timezone":"CDT","timezone_offset_sec":-18000,"id":"5a8843fa712d481595ebd41926cda45f"}}}
2020-06-30T15:39:04.183-0500 INFO [beat] instance/beat.go:1009 Process info {"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":null,"effective":null,"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat", "exe": "/home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/filebeat", "name": "filebeat", "pid": 5578, "ppid": 4156, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2020-06-30T15:39:03.640-0500"}}}
2020-06-30T15:39:04.183-0500 INFO instance/beat.go:298 Setup Beat: filebeat; Version: 8.0.0
2020-06-30T15:39:04.183-0500 INFO [publisher] pipeline/module.go:113 Beat name: lgroenendaal-XPS-15-9570
2020-06-30T15:39:04.184-0500 WARN beater/filebeat.go:151 Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500 INFO [monitoring] log/log.go:118 Starting metrics logging every 30s
2020-06-30T15:39:04.184-0500 INFO instance/beat.go:449 filebeat start running.
2020-06-30T15:39:04.184-0500 WARN beater/filebeat.go:251 Filebeat is unable to load the Ingest Node pipelines for the configured modules because the Elasticsearch output is not configured/enabled. If you have already loaded the Ingest Node pipelines or are using Logstash pipelines, you can ignore this warning.
2020-06-30T15:39:04.184-0500 INFO registrar/registrar.go:145 Loading registrar data from /home/lgroenendaal/src/github.com/elastic/beats/x-pack/filebeat/data/registry/filebeat/data.json
2020-06-30T15:39:04.184-0500 INFO registrar/registrar.go:152 States Loaded from registrar: 0
2020-06-30T15:39:04.184-0500 INFO [crawler] beater/crawler.go:71 Loading Inputs: 1
2020-06-30T15:39:04.185-0500 WARN [cfgwarn] s3/input.go:131 BETA: s3 input type is used
2020-06-30T15:39:04.185-0500 INFO [crawler] beater/crawler.go:141 Starting input (ID: 18222034013403473169)
2020-06-30T15:39:04.185-0500 INFO [crawler] beater/crawler.go:108 Loading and starting Inputs completed. Enabled inputs: 1
2020-06-30T15:39:04.186-0500 INFO [s3] s3/input.go:173 visibility timeout is set to 300 seconds
2020-06-30T15:39:04.186-0500 INFO [s3] s3/input.go:174 aws api timeout is set to 2m0s
2020-06-30T15:39:04.186-0500 INFO [s3] s3/input.go:196 s3 input worker has started. with queueURL: https://sqs.us-east-1.amazonaws.com/374144443638/cloudflare_logs_dev
2020-06-30T15:39:14.633-0500 ERROR [s3] s3/input.go:458 expand_event_list_from_field parameter is missing in config for application/json content-type file
2020-06-30T15:39:14.633-0500 ERROR [s3] s3/input.go:393 createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file
The imporant error is the last line which tells me that my object cannot be processed even though I think it should be processable: s3/input.go:393 createEventsFromS3Info failed processing file from s3 bucket "lucas-test-filebeat-s3" with name "s3filebeat.log.gz": expand_event_list_from_field parameter is missing in config for application/json content-type file
.
If I try, just for fun, to include the expand_event_list_from_field
config it will, understandably, fail to parse and we'll get the WARN log: s3/input.go:542 decode json failed for 's3filebeat.log.gz' from S3 bucket 'lucas-test-filebeat-s3', skipping this file: json: cannot unmarshal string into Go value of type []interface {}
.
Any advice on how to proceed here? Personally I think the appropriate change would be to ignore content-type: application/json
and have expand_event_list_from_field
be the ONLY thing controlling whether or not to parse the object content as JSON (which is how the logic used to be) but I defer to you maintainers because you have the vision for how you want the code to behave.
I really appreciate any who took the time to read this and I really apprecate the creation of this feature, it's a nifty one. Cheers.