Skip to content

BufferedTokenizer doesn't dice correctly the payload when restart processing after buffer full error #16483

Closed
@andsel

Description

@andsel

Logstash information:

Please include the following information:

  1. Logstash version (e.g. bin/logstash --version) any
  2. Logstash installation source (e.g. built from source, with a package manager: DEB/RPM, expanded from tar or zip archive, docker)
  3. How is Logstash being run (e.g. as a service/service manager: systemd, upstart, etc. Via command line, docker/kubernetes)

Plugins installed: (bin/logstash-plugin list --verbose)

JVM (e.g. java -version):

If the affected version of Logstash is 7.9 (or earlier), or if it is NOT using the bundled JDK or using the 'no-jdk' version in 7.10 (or higher), please provide the following information:

  1. JVM version (java -version)
  2. JVM installation source (e.g. from the Operating System's package manager, from source, etc).
  3. Value of the LS_JAVA_HOME environment variable if set.

OS version (uname -a if on a Unix-like system):

Description of the problem including expected versus actual behavior:
When BufferedTokenizer is used to dice the input, after a buffer full error, the input should be consumed till next separator and start correctly with the data after that separator

Steps to reproduce:
Mostly inspired by logstash-plugins/logstash-codec-json_lines#45 (comment)

  1. Configure Logstash to use the json_lines codec present in PR Re-established previous behaviour without a default limit for 'decode_size_limit_bytes' logstash-plugins/logstash-codec-json_lines#45
In Gemfile add:
gem "logstash-codec-json_lines", :path => "/path/to/logstash-codec-json_lines"
  1. From shell run bin/logstash-plugin install --no-verify
  2. start Logstash with following pipeline
input {
  tcp {
    port => 1234

    codec => json_lines {
      decode_size_limit_bytes => 100000
    }
  }
}

output {
  stdout {
    codec => rubydebug
  }
}
  1. Use the following script to generate some load
require 'socket' 
require 'json'

hostname = 'localhost'
port = 1234

socket = TCPSocket.open(hostname, port)

data = {"a" => "a"*105_000}.to_json + "\n"; socket.write(data[0...90_000])
data = {"a" => "a"*105_000}.to_json + "\n"; socket.write(data[90_000..] + "{\"b\": \"bbbbbbbbbbbbbbbbbbb\"}\n")

socket.close

Provide logs (if relevant):
Logstash generates 3 ebents:

{
  "message" => "Payload bigger than 100000 bytes",
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.755601Z,
  "tags" => [
    [0] "_jsonparsetoobigfailure"
  ]
}
{
  "b" => "bbbbbbbbbbbbbbbbbbb",
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.774574Z
}
{
  "a" => "aaaaa......a"
  "@version" => "1",
  "@timestamp" => 2024-10-01T10:49:55.774376Z
}

Instead of 2, one with the _jsonparsetoobigfailure error for the message made of a and then a valid with bs.
The extended motivation is explained in logstash-plugins/logstash-codec-json_lines#45 (comment)

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions