-
Couldn't load subscription status.
- Fork 1.9k
Description
A note for the community
- Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
- If you are interested in working on this issue or have submitted a pull request, please leave a comment
Problem
Summary
When using \r\n (or any other multi-char delimiter) as the line_delimiter in the file source, Vector can incorrectly merge separate log events into a single event when the \r\n delimiter happens to be split across a buffer boundary (\r before the buffer ends, \n already in the new buffer -> not recognized as delimiter). This causes unpredictable parsing failures at "random" and it was a pain to track down :)
Root Cause
The bug is in lib/file-source-common/src/buffer.rs in the read_until_with_max_size function. When searching for a multi-byte delimiter (e.g., \r\n). This is my current understanding but ... I'm no expert in the Vector internals so I might be wrong.
AsyncBufRead::fill_buf()returns data from the internal buffer (it's a defaultBufReaderso probably 8192 bytes)- If the buffer ends with
\rand the next buffer starts with\n, theFinder::find()call doesn't find the complete\r\ndelimiter - The old code would consume all bytes including the trailing
\r - On the next iteration, the buffer starts with
\n, which doesn't match\r\n - Result: The delimiter is never found, and content from both sides gets merged
Reproduction
Create a file where a \r\n delimiter falls at byte positions that are multiples of 8192 (the default BufReader capacity):
Here is a repository with test data, a script to test it and example vector configuration showing the bug: https://github.com/lfrancke/vector-repro-24027
Fix idea
Track partial delimiter matches across loop iterations using a buffer. The buffer is tiny (at most the length of the delimiter being used).
- Makes the logic slightly more complicated on buffer boundaries
- Should handle all edge cases on buffer boundaries
- Minimal performance impact as we only need to check once on a buffer boundary
Bad ideas (subjective :) )
- We could somehow continue reading when we are in the middle of a partial match but that's probably complex with
fill_bufetc. - Increase buffer size to make it less likely, also more memory usage
Next steps
Unless anyone has a better idea I'd
- [DONE]
Create a repository with a test script, test config and data generator to show the bug - [DONE]
Create a PR implementing the state tracking approach (I have a working version here but it's littered with debug statements and not-nice code but it DOES work which means I'm reasonably confident that this is actually the bug- fix(file source) Fix a data corruption bug with multi-char delimiters #24028
I have never looked into the Vector code in detail before so this PR definitely needs careful review so I don't forget any side-effects, also it's been a while since I've coded Rust in anger.
Version
I tried it on 0.46.0, 0.50.0 and the current master branch