Skip to content

Multi-byte line delimiters split across buffer boundaries cause log event merging #24027

@lfrancke

Description

@lfrancke

A note for the community

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Problem

Summary

When using \r\n (or any other multi-char delimiter) as the line_delimiter in the file source, Vector can incorrectly merge separate log events into a single event when the \r\n delimiter happens to be split across a buffer boundary (\r before the buffer ends, \n already in the new buffer -> not recognized as delimiter). This causes unpredictable parsing failures at "random" and it was a pain to track down :)

Root Cause

The bug is in lib/file-source-common/src/buffer.rs in the read_until_with_max_size function. When searching for a multi-byte delimiter (e.g., \r\n). This is my current understanding but ... I'm no expert in the Vector internals so I might be wrong.

  1. AsyncBufRead::fill_buf() returns data from the internal buffer (it's a default BufReader so probably 8192 bytes)
  2. If the buffer ends with \r and the next buffer starts with \n, the Finder::find() call doesn't find the complete \r\n delimiter
  3. The old code would consume all bytes including the trailing \r
  4. On the next iteration, the buffer starts with \n, which doesn't match \r\n
  5. Result: The delimiter is never found, and content from both sides gets merged

Reproduction

Create a file where a \r\n delimiter falls at byte positions that are multiples of 8192 (the default BufReader capacity):

Here is a repository with test data, a script to test it and example vector configuration showing the bug: https://github.com/lfrancke/vector-repro-24027

Fix idea

Track partial delimiter matches across loop iterations using a buffer. The buffer is tiny (at most the length of the delimiter being used).

  • Makes the logic slightly more complicated on buffer boundaries
  • Should handle all edge cases on buffer boundaries
  • Minimal performance impact as we only need to check once on a buffer boundary

Bad ideas (subjective :) )

  • We could somehow continue reading when we are in the middle of a partial match but that's probably complex with fill_buf etc.
  • Increase buffer size to make it less likely, also more memory usage

Next steps

Unless anyone has a better idea I'd

  • [DONE] Create a repository with a test script, test config and data generator to show the bug
  • [DONE] Create a PR implementing the state tracking approach (I have a working version here but it's littered with debug statements and not-nice code but it DOES work which means I'm reasonably confident that this is actually the bug - fix(file source) Fix a data corruption bug with multi-char delimiters #24028

I have never looked into the Vector code in detail before so this PR definitely needs careful review so I don't forget any side-effects, also it's been a while since I've coded Rust in anger.

Version

I tried it on 0.46.0, 0.50.0 and the current master branch

Metadata

Metadata

Assignees

No one assigned

    Labels

    source: fileAnything `file` source relatedtype: bugA code related bug.

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions