Description
Elasticsearch version (bin/elasticsearch --version
):
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
Version: 7.3.1, Build: default/deb/4749ba6/2019-08-19T20:19:25.651794Z, JVM: 12.0.2
Plugins installed: []
# curl -sS http://localhost:9200/_cat/plugins |awk '{print $2 " " $3}' |sort |uniq |sort
discovery-ec2 7.3.1
repository-s3 7.3.1
JVM version (java -version
):
/usr/share/elasticsearch/jdk/bin/java --version
openjdk 12.0.2 2019-07-16
OpenJDK Runtime Environment (build 12.0.2+10)
OpenJDK 64-Bit Server VM (build 12.0.2+10, mixed mode, sharing)
OS version (uname -a
if on a Unix-like system):
Linux obs6esmaster-i-YYYYYYYobsstore01.XXXXX.awn 4.4.0-1079-aws #89-Ubuntu SMP Tue Mar 26 15:25:52 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
** Filebeat version ** (while the issue seems Elasticsearch centric noting version of filebeat for repo )
root@obs6esmaster-i-094de1714b4abf10f:/usr/share/java# filebeat version
filebeat version 7.3.1 (amd64), libbeat 7.3.1 [a4be71b90ce3e3b8213b616adfcd9e455513da45 built 2019-08-19 19:30:50 +0000 UTC]
Description of the problem including expected versus actual behaviour:
When using filebeat locally on master only nodes, Elasticsearch 6.7.1 versions of both elastic and filebeat were able to send metrics and log events without issue into elastic itself.
Depoying a new version of both elasticsearch and filebeat, to version 7.3.1 we noted that master only nodes throw java.lang.NullPointerException
exceptions, with _BULK
payloads sent from filebeat into its logging ingestion pipeline.
If we configured the filebeat running on the master node, to send its events to any other non-master node in the cluster - the issue is avoided. The clusters above are configured to use dedicated ingestion nodes (so masters cannot execute pipelines directly)
One key point of interest. If we ENABLE the master node to also act as an ingestion node, the java.lang.NullPointerException
does NOT occur. Hopefully this helps narrow down the bug.
Problem Statement Summary
Filebeat 7.3.1 sending logging events to the _BULK ingestion pipeline named filebeat-7.3.1-elasticsearch-server-pipeline
causes the master node filebeat is using to throw a receive java.lang.NullPointerException / HTTP 500
exception / error. This did not occur with identical deployments of version 6.7.1 of both elastic and filebeat. And as noted above, if we enable ingestion on the masters, this does not occur.
Other nodes in our cluster do NOT have ingestion enabled, but do not have this issue. We only have found this on MASTER only nodes thus far and only with version 7.3.1 vs our prior 6.7.1.
Steps to reproduce:
1.Configure a cluster with dedicated ingestion nodes, such that master nodes Can not execute pipeline ingestions, version 7.3.1 we've confirmed this issue with. And we've confirmed this does not occur with version 6.7.1 we've ran the same configuration prior
2.Configure filebeat to run on the master nodes and send events via localhost:9200
3. You should see the following errors as we have
4. Workaround - if we send events to non-master nodes, _BULK
ingestion from the filebeat module works
Example Configurations used:
Filebeat
root@obs6esmaster-i-XXXXXXXXXX:/etc/filebeat# cat filebeat.yml |grep -v "#\|^$"
filebeat.inputs:
- type: log
enabled: false
paths:
- /var/log/*.log
filebeat.modules:
- module: system
- module: elasticsearch
gc:
enabled: false
setup.template.settings:
index.number_of_shards: 3
index.codec: best_compression
_source.enabled: true
output.elasticsearch:
hosts: ["localhost:9200"]
processors:
- add_host_metadata: ~
- add_cloud_metadata: ~
logging.level: debug
logging.to_files: true
logging.files:
path: /var/log/filebeat
name: filebeat
keepfiles: 7
permissions: 0644
xpack.monitoring.enabled: true
xpack.monitoring.elasticsearch:
Elasticseach Cluster
root@obs6esmaster-i-XXXXXXXXXX:/etc/filebeat# cat /etc/elasticsearch/elasticsearch.yml |grep -v "#\|^$"
cluster.name: obsstore01.YYYYY.ZZZZZ
cluster.initial_master_nodes: [ larry, curly, moe ]
node.master: true
node.data: false
node.ingest: false
node.ml: false
path:
data: /mnt/disk1/elasticsearch/data
logs: /var/log/elasticsearch
bootstrap.memory_lock: true
network.host: [ 10.113.89.170, _local_ ]
http.port: 9200
transport.port: 9300
transport.publish_port: 9300
xpack.monitoring.enabled: true
xpack.monitoring.collection.enabled: true
xpack.monitoring.collection.cluster.stats.timeout : 20s
xpack.monitoring.collection.node.stats.timeout: 20s
xpack.monitoring.collection.index.stats.timeout: 30s
xpack.watcher.enabled: false
xpack.graph.enabled: false
xpack.ml.enabled: false
xpack.security.enabled: false
discovery:
seed_providers: ec2
ec2:
any_group: false
endpoint: ec2.<some_region_seeeecret>.amazonaws.com
tag:
Cell: not_telling_you_sorry
Environment: some_super_secret_env_name
gateway.expected_nodes: 1
gateway.recover_after_time: 2m
gateway.recover_after_nodes: 1
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 2
reindex.remote.whitelist: "obs7esingest*:*, obs7esquery*:*"
cluster.routing.allocation.disk.threshold_enabled: true
cluster.routing.allocation.disk.watermark.low: 95%
cluster.routing.allocation.disk.watermark.high: 98%
cluster.routing.allocation.disk.watermark.flood_stage: 99.5%
action.destructive_requires_name: false
action.auto_create_index: .security,.monitoring*,.watches,.triggered_watches,.watcher-history-*,.ml*, filebeat-*
http.cors.enabled: true
http.cors.allow-origin: /.*/
http.cors.allow-credentials: true
http.cors.allow-headers: X-Requested-With, Content-Type, Content-Length
indices.memory.index_buffer_size: 10%
Provide logs (if relevant):
Example exception log from a master receiving the _BULK payload from the localhost instance of file-beat.
[2019-09-11T20:24:10,815][WARN ][r.suppressed ] [obs6esmaster-i-0ee5f1df872af9b2f.obsstore01.gjacobs.dev2.awn] path: /_bulk, params: {}
org.elasticsearch.transport.RemoteTransportException: [obs6esingest-i-0257e445e9599463d.obsstore01.gjacobs.dev2.awn][10.113.147.73:9300][indices:data/write/bulk]
Caused by: java.lang.NullPointerException
at org.elasticsearch.common.io.stream.StreamOutput.writeString(StreamOutput.java:403) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.DocWriteResponse.writeTo(DocWriteResponse.java:283) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.update.UpdateResponse.writeTo(UpdateResponse.java:82) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.bulk.BulkItemResponse.writeTo(BulkItemResponse.java:510) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.bulk.BulkResponse.writeTo(BulkResponse.java:148) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.OutboundMessage.writeMessage(OutboundMessage.java:70) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.OutboundMessage.serialize(OutboundMessage.java:53) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:169) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.OutboundHandler$MessageSerializer.get(OutboundHandler.java:155) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.OutboundHandler$SendContext.get(OutboundHandler.java:202) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.OutboundHandler.internalSend(OutboundHandler.java:132) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.OutboundHandler.sendMessage(OutboundHandler.java:127) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.OutboundHandler.sendResponse(OutboundHandler.java:107) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TcpTransportChannel.sendResponse(TcpTransportChannel.java:64) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:54) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:47) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.support.ChannelActionListener.onResponse(ChannelActionListener.java:30) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$1(TransportBulkAction.java:675) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:112) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:473) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onResponse(TransportBulkAction.java:454) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onResponse(TransportBulkAction.java:443) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishOnSuccess(TransportReplicationAction.java:846) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleResponse(TransportReplicationAction.java:765) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleResponse(TransportReplicationAction.java:749) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:224) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:216) [elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:141) [elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) [elasticsearch-7.3.1.jar:7.3.1]
at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.3.1.jar:7.3.1]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar:4.1.36.Final]
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar:4.1.36.Final]
at java.lang.Thread.run(Thread.java:835) [?:?]
The localhost instance of filebeat reports the same error on its side as per
2019-09-11T20:33:03 obs6esmaster-i-0ee5f1df872af9b2f.obsstore01.gjacobs.dev2.awn filebeat[20238]: 2019-09-11T20:33:03.717Z#011ERROR#011elasticsearch/client.go:343#011Failed to perform any bulk index operations: 500 Internal Server Error: {"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[obs6esingest-i-0257e445e9599463d.obsstore01.gjacobs.dev2.awn][10.113.147.73:9300][indices:data/write/bulk]"}],"type":"null_pointer_exception","reason":null},"status":500}
A packet capture of the same failures above show similar as
POST /_bulk HTTP/1.1
Host: localhost:9200
User-Agent: Go-http-client/1.1
Content-Length: 71073
Accept: application/json
Content-Type: application/json; charset=UTF-8
Accept-Encoding: gzip
{"index":{"_index":"filebeat-7.3.1","pipeline":"filebeat-7.3.1-elasticsearch-server-pipeline"}}
{"@timestamp":"2019-09-11T20:32:23.520Z","message":"[2019-09-11T20:23:07.523+0000][18816][safepoint ] Application time: 0.0326183 seconds","fileset":{"name":"server"},"input":{"type":"log"},"agent":{"type":"filebeat","ephemeral_id":"ddc04c98-fe46-41bf-934e-165abc3b2a25","hostname":"obs6esmaster-i-0ee5f1df872af9b2f","id":"5de46c1a-080a-4c23-8962-ec876ecfdee6","version":"7.3.1"},"ecs":{"version":"1.0.1"},"cloud":{"provider":"aws","instance":{"id":"i-0ee5f1df872af9b2f"},"machine":{"type":"m5.large"},"region":"us-west-2","availability_zone":"us-west-2a","account":{"id":"099078926092"},"image":{"id":"ami-0eb23bc37ff79dad8"}},"log":{"file":{"path":"/var/log/elasticsearch/gc.log"},"offset":68058},"event":{"module":"elasticsearch","dataset":"elasticsearch.server","timezone":"+00:00"},"host":{"name":"obs6esmaster-i-0ee5f1df872af9b2f","containerized":false,"hostname":"obs6esmaster-i-0ee5f1df872af9b2f","architecture":"x86_64","os":{"version":"16.04.6 LTS (Xenial Xerus)","family":"debian","name":"Ubuntu","kernel":"4.4.0-1079-aws","codename":"xenial","platform":"ubuntu"},"id":"95d49d0d589d4c9d9923414a4fb3d908"},"service":{"type":"elasticsearch"}}
---- Ommited rest of events to reduce size of this GitHub issue ----
And the corresponding response back from ES to Filebeat being
HTTP/1.1 500 Internal Server Error
content-type: application/json; charset=UTF-8
access-control-allow-credentials: true
content-encoding: gzip
content-length: 204
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[obs6esingest-i-0f6a4745df63266f5.obsstore01.gjacobs.dev2.awn][10.113.123.205:9300][indices:data/write/bulk]"}],"type":"null_pointer_exception","reason":null},"status":500}
We can see that we're able to receive x-pack metrics from the filebeat running locally on this master only node, but notice the failure for the logging events rates wise
I have done further tests to try and isolate and reproduce the issue but it seems very specific to the content/payload perhaps of filebeat in part (or else related to gzip, or content size)
I could not reproduce the issue by hand using non-filebeat data to the same ingestion pipeline - see curl example below
root@obs6esmaster-i-XXXXXXX:/etc/filebeat# curl -H 'Content-Type: application/json' -vvvv -XPUT http://localhost:9200/_bulk -d $'{\"index\":{\"_index\":\"filebeat-7.3.1\",\"pipeline\":\"filebeat-7.3.1-elasticsearch-server-pipeline\"}} \0/_bulk -d $'{\"index\":{\"_index\":\"myindex\
{\"field": \"value\"}\n'
* Trying 127.0.0.1...
* Connected to localhost (127.0.0.1) port 9200 (#0)
> PUT /_bulk HTTP/1.1
> Host: localhost:9200
> User-Agent: curl/7.47.0
> Accept: */*
> Content-Type: application/json
> Content-Length: 118
>
* upload completely sent off: 118 out of 118 bytes
< HTTP/1.1 200 OK
< content-type: application/json; charset=UTF-8
< content-length: 281
< access-control-allow-credentials: true
<
* Connection #0 to host localhost left intact
{"took":169,"ingest_took":0,"errors":false,"items":[{"index":{"_index":"filebeat-7.3.1-2019.09.11-000001","_type":"_doc","_id":"2ZlCIm0BZsr1j6UGjRqc","_version":1,"result":"created","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":333417,"_primary_term":1,"status":201}}]}
I can provide a localhost:9200 pcap of the data from filebeat that causes the NullPointers to fire if required.