Skip to content

Bytes get corrupted on network layer #54425

Closed
@danielmitterdorfer

Description

@danielmitterdorfer

Elasticsearch version (bin/elasticsearch --version): master(see below for details)

Plugins installed: none

JVM version (java -version): OpenJDK Runtime Environment 18.9 (build 11.0.2+7)

OS version (uname -a if on a Unix-like system): Linux 4.13.0-45-generic

Description of the problem including expected versus actual behavior:

When running a bulk-benchmark with three nodes we get errors like these since Friday March 27:

  • HTTP status: 500, message: Invalid vInt ((fffffffd & 0x7f) << 28) | fffffff'
  • HTTP status: 500, message: Invalid string; unexpected character: 255 hex: ff
  • HTTP status: 500, message: tried to read: 264241151 bytes but only 130117 remaining
  • HTTP status: 500, message: Failed to deserialize response from handler [org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler]

Here is the stack trace from one of these errors:

[2020-03-28T01:37:43,598][WARN ][o.e.c.r.a.AllocationService] [rally-node-0] failing shard [failed shard, shard [osmgeopoints][3], node[177SLf7WTDCnrmB--Ryr2A], [R], s[STARTED], a[id=scJR6Wm3TmyemaIAAAnRcA], message [failed to perform indices:data/write/bulk[s] on replica [
osmgeopoints][3], node[177SLf7WTDCnrmB--Ryr2A], [R], s[STARTED], a[id=scJR6Wm3TmyemaIAAAnRcA]], markAsStale [true], failure [org.elasticsearch.transport.RemoteTransportException: [rally-node-1][192.168.20.32:39300][indices:data/write/bulk[s][r]]
Caused by: java.io.IOException: Invalid string; unexpected character: 146 hex: 92
        at org.elasticsearch.common.io.stream.StreamInput.throwOnBrokenChar(StreamInput.java:543)
        at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:502)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteShardRequest.<init>(TransportReplicationAction.java:1044)
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteReplicaRequest.<init>(TransportReplicationAction.java:1112)
        at org.elasticsearch.action.support.replication.TransportReplicationAction.lambda$new$1(TransportReplicationAction.java:146)
        at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:56)
        at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:185)
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:129)
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:106)
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:672)
        at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:136)
        at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:68)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:227)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355)
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at java.lang.Thread.run(Thread.java:834)
]]
org.elasticsearch.transport.RemoteTransportException: [rally-node-1][192.168.20.32:39300][indices:data/write/bulk[s][r]]
Caused by: java.io.IOException: Invalid string; unexpected character: 146 hex: 92
        at org.elasticsearch.common.io.stream.StreamInput.throwOnBrokenChar(StreamInput.java:543) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readString(StreamInput.java:502) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteShardRequest.<init>(TransportReplicationAction.java:1044) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$ConcreteReplicaRequest.<init>(TransportReplicationAction.java:1112) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.lambda$new$1(TransportReplicationAction.java:146) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:56) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.handleRequest(InboundHandler.java:185) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:129) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:106) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:672) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:136) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:110) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:68) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355) ~[?:?]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:227) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:355) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:377) ~[?:?]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:363) ~[?:?]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919) ~[?:?]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:714) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:615) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:578) ~[?:?]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493) ~[?:?]
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) ~[?:?]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[?:?]
        at java.lang.Thread.run(Thread.java:834) [?:?]

After some bisecting I could narrow it down to #48263.

Steps to reproduce:

Run a benchmark with Rally against a three node cluster with the affected commit 42150d3 and observe the failure after a while:

esrally --target-host="HOST_IP_A:39200,HOST_IP_B:39200,HOST_IP_C:39200"  --track="geopoint" --challenge="append-no-conflicts-index-only" --car="4gheap" --client-options="timeout:240" --runtime-jdk="11" --track-params="{\"number_of_replicas\": 1}" --pipeline="from-sources-complete" --revision="42150d3674089f415e0b8c54993aef2a03d016b5" --on-error=abort

Notes:

  • you need to set proper IPs for HOST_IP_A, HOST_IP_B and HOST_IP_C
  • we did not observe the issue with single-node clusters

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions