Parquet UTF-8 max statistics are overly pessimistic

**Describe the bug**
A recent [comment](https://github.com/apache/datafusion/pull/12978#discussion_r1823412482) regarding incrementing UTF-8 encoded characters got me wondering about the implementation in parquet-rs. It turns out that the `increment_utf8` function in the `parquet::column::writer` module isn't optimal.

Take for example the Unicode character `U+00FF` ('ÿ'), which is encoded as `0xC3BF` or `b11000111 b10111111`. Incrementing this code point by one should yield `U+0100` (`0xC480` or 'Ā'). `increment_utf8`, however, will yield instead `U+013F` ('Ŀ'), due to how overflow is handled.

https://github.com/apache/arrow-rs/blob/08ac58761cb9a80b6ffd651727aa957db0c32bd7/parquet/src/column/writer/mod.rs#L1444-L1451

What happens in this case is the lower byte `0xBF` is incremented to `0xC0` which is not a valid UTF-8 continuation character. The byte is then left alone and the next higher byte `0xC3` is incremented to `0xC4`, yielding a final UTF-8 encoded char of `0xC4BF` which is Unicode `U+013F`. This is analogous to incrementing the high nybble of an ASCII character, for instance 'O' (`0x4F`) becoming '_' (`0x5F`) rather than 'P' (`0x50`).

The end result is still a valid upper bound, but it is not as good as it could be and could result in some row groups or pages being read when they don't actually need to be.

**To Reproduce**
Add the following to `test_increment_utf8` and it will pass.
```rust
        let s = "\u{ff}\u{ff}";
        let s1 = "\u{ff}\u{100}";
        let s2 = "\u{ff}\u{13f}";
        let v = increment_utf8(s.as_bytes().to_vec()).unwrap();
        assert_eq!(v, s2.as_bytes());
        assert_ne!(v, s1.as_bytes());
```

**Expected behavior**
The above test should fail, i.e. the incremented string should be "ÿĀ" rather than "ÿĿ".

**Additional context**
One possible fix would be to do something more akin to what is being done in https://github.com/apache/datafusion/pull/12978, where the string data is first converted to chars, incremented, and then converted back to UTF-8. This can result in some characters increasing in size, which may not be desired here considering we're trying to truncate to less than a certain number of bytes. Another alternative is to set overflowed continuation bytes to `0x80` (the minimum valid continuation byte) when moving on to the next higher byte. If the character is at a byte width boundary, the bytes for that character can be set to `0` and the next higher character can be incremented. This would not increase the byte count, but can result in less optimal upper bounds (but still better than the current ones).

	let original = data[idx];
	let (byte, overflow) = original.overflowing_add(1);
	if !overflow {
	data[idx] = byte;
	if str::from_utf8(&data).is_ok() {
	return Some(data);
	}
	data[idx] = original;

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Parquet UTF-8 max statistics are overly pessimistic #6867

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Parquet UTF-8 max statistics are overly pessimistic #6867

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions