-
Notifications
You must be signed in to change notification settings - Fork 995
Description
Describe the bug
A recent comment regarding incrementing UTF-8 encoded characters got me wondering about the implementation in parquet-rs. It turns out that the increment_utf8
function in the parquet::column::writer
module isn't optimal.
Take for example the Unicode character U+00FF
('ÿ'), which is encoded as 0xC3BF
or b11000111 b10111111
. Incrementing this code point by one should yield U+0100
(0xC480
or 'Ā'). increment_utf8
, however, will yield instead U+013F
('Ŀ'), due to how overflow is handled.
arrow-rs/parquet/src/column/writer/mod.rs
Lines 1444 to 1451 in 08ac587
let original = data[idx]; | |
let (byte, overflow) = original.overflowing_add(1); | |
if !overflow { | |
data[idx] = byte; | |
if str::from_utf8(&data).is_ok() { | |
return Some(data); | |
} | |
data[idx] = original; |
What happens in this case is the lower byte 0xBF
is incremented to 0xC0
which is not a valid UTF-8 continuation character. The byte is then left alone and the next higher byte 0xC3
is incremented to 0xC4
, yielding a final UTF-8 encoded char of 0xC4BF
which is Unicode U+013F
. This is analogous to incrementing the high nybble of an ASCII character, for instance 'O' (0x4F
) becoming '_' (0x5F
) rather than 'P' (0x50
).
The end result is still a valid upper bound, but it is not as good as it could be and could result in some row groups or pages being read when they don't actually need to be.
To Reproduce
Add the following to test_increment_utf8
and it will pass.
let s = "\u{ff}\u{ff}";
let s1 = "\u{ff}\u{100}";
let s2 = "\u{ff}\u{13f}";
let v = increment_utf8(s.as_bytes().to_vec()).unwrap();
assert_eq!(v, s2.as_bytes());
assert_ne!(v, s1.as_bytes());
Expected behavior
The above test should fail, i.e. the incremented string should be "ÿĀ" rather than "ÿĿ".
Additional context
One possible fix would be to do something more akin to what is being done in apache/datafusion#12978, where the string data is first converted to chars, incremented, and then converted back to UTF-8. This can result in some characters increasing in size, which may not be desired here considering we're trying to truncate to less than a certain number of bytes. Another alternative is to set overflowed continuation bytes to 0x80
(the minimum valid continuation byte) when moving on to the next higher byte. If the character is at a byte width boundary, the bytes for that character can be set to 0
and the next higher character can be incremented. This would not increase the byte count, but can result in less optimal upper bounds (but still better than the current ones).