Multi-byte line delimiters split across buffer boundaries cause log event merging

### A note for the community


* Please vote on this issue by adding a 👍 [reaction](https://blog.github.com/2016-03-10-add-reactions-to-pull-requests-issues-and-comments/) to the original issue to help the community and maintainers prioritize this request
* If you are interested in working on this issue or have submitted a pull request, please leave a comment


### Problem

## Summary

When using `\r\n` (or any other multi-char delimiter) as the `line_delimiter` in the file source, Vector can incorrectly merge separate log events into a single event when the `\r\n` delimiter happens to be split across a buffer boundary (`\r` before the buffer ends, `\n` already in the new buffer -> not recognized as delimiter). This causes unpredictable parsing failures at "random" and it was a pain to track down :)

### Root Cause

The bug is in `lib/file-source-common/src/buffer.rs` in the `read_until_with_max_size` function. When searching for a multi-byte delimiter (e.g., `\r\n`). This is my current understanding but ... I'm no expert in the Vector internals so I might be wrong.

1. `AsyncBufRead::fill_buf()` returns data from the internal buffer (it's a default `BufReader` so probably 8192 bytes)
2. If the buffer ends with `\r` and the next buffer starts with `\n`, the `Finder::find()` call doesn't find the complete `\r\n` delimiter
3. The old code would consume all bytes including the trailing `\r`
4. On the next iteration, the buffer starts with `\n`, which doesn't match `\r\n`
5. Result: The delimiter is never found, and content from both sides gets merged

### Reproduction

Create a file where a `\r\n` delimiter falls at byte positions that are multiples of 8192 (the default `BufReader` capacity):

Here is a repository with test data, a script to test it and example vector configuration showing the bug: https://github.com/lfrancke/vector-repro-24027

### Fix idea

Track partial delimiter matches across loop iterations using a buffer. The buffer is tiny (at most the length of the delimiter being used).

- Makes the logic slightly more complicated on buffer boundaries
- Should handle all edge cases on buffer boundaries
- Minimal performance impact as we only need to check once on a buffer boundary

#### Bad ideas (subjective :) )

- We could somehow continue reading when we are in the middle of a partial match but that's probably complex with `fill_buf` etc.
- Increase buffer size to make it less likely, also more memory usage

### Next steps

Unless anyone has a better idea I'd

- [DONE] ~Create a repository with a test script, test config and data generator to show the bug~
- [DONE] ~Create a PR implementing the state tracking approach (I have a working version here but it's littered with debug statements and not-nice code but it DOES work which means I'm reasonably confident that this is actually the bug~ - https://github.com/vectordotdev/vector/pull/24028

I have never looked into the Vector code in detail before so this PR definitely needs careful review so I don't forget any side-effects, also it's been a while since I've coded Rust in anger.

### Version

I tried it on 0.46.0, 0.50.0 and the current master branch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multi-byte line delimiters split across buffer boundaries cause log event merging #24027

A note for the community

Problem

Summary

Root Cause

Reproduction

Fix idea

Bad ideas (subjective :) )

Next steps

Version

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Multi-byte line delimiters split across buffer boundaries cause log event merging #24027

Description

A note for the community

Problem

Summary

Root Cause

Reproduction

Fix idea

Bad ideas (subjective :) )

Next steps

Version

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions