How to enable statistics for string columns? #5270

jonashaag · 2024-01-02T09:29:12Z

Describe the bug

I'm using the https://github.com/pacman82/odbc2parquet library that is based on this crate.

I observe that statistics like min/max are not written for string columns:

In [4]: pq.ParquetFile("/tmp/o2p").metadata.row_group(0).column(1)
Out[4]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x1033c1080>
  file_offset: 1123
  file_path:
  physical_type: BYTE_ARRAY
  num_values: 100
  path_in_schema: XXX
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x103476070>
      has_min_max: False
      min: None
      max: None
      null_count: None
      distinct_count: None
      num_values: 100
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: ZSTD
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 394
  data_page_offset: 938
  total_compressed_size: 729
  total_uncompressed_size: 2993

Relevant code: https://github.com/pacman82/odbc2parquet/blob/b571cad6fae1b58e1aab8348f14b32f20d6ec165/src/query/parquet_writer.rs#L47

To Reproduce

Use odbc2parquet to download any table that contains a string column

Expected behavior

Should have min/max statistics.

Additional context

The text was updated successfully, but these errors were encountered:

tustvold · 2024-01-02T10:22:12Z

Perhaps you might provide a reproducer for this, or at the very least an example file exhibiting this property. I wonder if this might be a quirk of pyarrow...

The following test passes

#[test]
fn test_byte_array_statistics_full() {
    let schema = Arc::new(
        types::Type::group_type_builder("schema")
            .with_fields(vec![Arc::new(
                types::Type::primitive_type_builder("col1", Type::BYTE_ARRAY)
                    .with_converted_type(ConvertedType::UTF8)
                    .with_logical_type(Some(LogicalType::String))
                    .build()
                    .unwrap(),
            )])
            .build()
            .unwrap(),
    );
    let props = Arc::new(WriterProperties::builder().build());

    let mut buffer = Vec::with_capacity(1024);
    let mut writer = SerializedFileWriter::new(&mut buffer, schema, props).unwrap();

    let mut row_group = writer.next_row_group().unwrap();
    let mut column = row_group.next_column().unwrap().unwrap();
    column
        .typed::<ByteArrayType>()
        .write_batch(
            &[
                ByteArray::from(b"123".to_vec()),
                ByteArray::from(b"abc".to_vec()),
            ],
            None,
            None,
        )
        .unwrap();
    column.close().unwrap();
    row_group.close().unwrap();

    let metadata = writer.close().unwrap();
    let col = &metadata.row_groups[0].columns[0];
    let col_meta = col.meta_data.as_ref().unwrap();
    let col_stats = col_meta.statistics.as_ref().unwrap();
    assert_eq!(col_stats.min.as_ref().unwrap(), b"123");
    assert_eq!(col_stats.max.as_ref().unwrap(), b"abc");
}

jonashaag · 2024-01-02T12:03:41Z

In [3]: pq.ParquetFile("/tmp/no_stats.parquet").metadata.row_group(0).column(0)
Out[3]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x105bf7100>
  ...
  statistics:
    <pyarrow._parquet.Statistics object at 0x125cf6200>
      has_min_max: False
      min: None
      max: None
      ...

In [5]: table = pq.read_table("/tmp/no_stats.parquet")

# Re-write table with PyArrow
In [6]: with pq.ParquetWriter("/tmp/with_stats.parquet", schema=table.schema) as w:
   ...:     w.write_table(table)
   ...:

In [7]: pq.ParquetFile("/tmp/with_stats.parquet").metadata.row_group(0).column(0)
Out[7]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x105660680>
  ...
  statistics:
    <pyarrow._parquet.Statistics object at 0x125cf5da0>
      has_min_max: True
      min: 01
      max: 01
      ...

no_stats.parquet.zip

jonashaag · 2024-01-02T12:07:40Z

I'm pretty sure it's not a quirk of PyArrow. I realized that filtering with Polars/DuckDB on string columns on Parquet files created with the odbc2parquet tool is much slower than on files created with PyArrow.

tustvold · 2024-01-02T12:09:07Z

Ok so the statistics are present in the file

#[test]
    fn test_example() {
        let file = File::open("/home/raphael/Downloads/no_stats.parquet").unwrap();
        let reader = SerializedFileReader::new(file).unwrap();
        let metadata = reader.metadata().row_group(0).column(0);
        let col_stats = metadata.statistics().unwrap();
        println!("{col_stats:?}");
    }

ByteArray({min: Some(ByteArray { data: "01" }), max: Some(ByteArray { data: "01" }), distinct_count: None, null_count: 0, min_max_deprecated: false, min_max_backwards_compatible: false, max_value_exact: false, min_value_exact: false})

So the question is now why pyarrow is unhappy with those statistics, I vaguely remember some bug/limitation in pyarrow related to this - let me see if I can dig it out.

is much slower than on files created with PyArrow

This could be for a very large number of reasons

Edit: I can't find the exact issue I am looking for but these are all related

Perhaps @mapleFU you might have the relevant information to hand?

pacman82 · 2024-01-02T18:23:50Z

Hello @tustvold ,

I do not know if this helps to clear things up, or causes more confusion, but here is my attempt to reproduce the issue directly with odbc2parquet.

See: https://github.com/pacman82/odbc2parquet/blob/42295d22b4ac442acdb70fea9c18752c830de6e1/tests/integration.rs#L3973

#[test]
fn write_statistics_for_text_columns() {
    // Setup table for test
    let table_name = "WriteStatisticsForTextColumns";
    let mut table = TableMssql::new(table_name, &["VARCHAR(10)"]);
    table.insert_rows_as_text(&[["aaa"], ["zzz"]]);
    let query = format!("SELECT a FROM {table_name}");

    let command = Command::cargo_bin("odbc2parquet")
        .unwrap()
        .args([
            "query",
            "--connection-string",
            MSSQL,
            "-", // Use `-` to explicitly write to stdout
            &query,
        ])
        .assert()
        .success();

    // Then
    let bytes = Bytes::from(command.get_output().stdout.clone());
    let reader = SerializedFileReader::new(bytes).unwrap();
    let stats = reader.metadata().row_group(0).column(0).statistics().unwrap();
    assert_eq!("aaa", str::from_utf8(stats.min_bytes()).unwrap());
    assert_eq!("zzz", str::from_utf8(stats.max_bytes()).unwrap());
}

The above code executes and the tests passes. This hints that at reading the file with the Rust parquet crate the statistics are present. Yet they seem to be written differently to what the Python stack seems to expect.

Best, Markus

tustvold · 2024-01-02T18:26:17Z

Yet they seem to be written differently to what the Python stack seems to expect.

Yes, I think we need the python/arrow-cpp developers to weigh in here as to what is going on

Jefffrey · 2024-01-02T23:36:10Z

I think this is actually due to #5158 only having recently been merged in.

Some reproduction code:

use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use parquet::arrow::ArrowWriter;
use parquet::errors::Result;
use parquet::file::reader::FileReader;
use parquet::file::serialized_reader::SerializedFileReader;
use std::fs::File;

fn main() -> Result<()> {
    println!("checking file from issue");
    let no_stats_path = "/home/jeffrey/Downloads/no_stats.parquet";
    let file = File::open(no_stats_path)?;
    let reader = SerializedFileReader::new(file)?;
    dbg!(reader.metadata().file_metadata().column_order(0));

    println!("checking file after rewritten by pyarrow");
    let with_stats_path = "/home/jeffrey/Downloads/with_stats.parquet";
    let file = File::open(with_stats_path)?;
    let reader = SerializedFileReader::new(file)?;
    dbg!(reader.metadata().file_metadata().column_order(0));

    println!("rewriting file from issue with latest parquet-rs");
    let file = File::open(no_stats_path)?;
    let mut reader = ParquetRecordBatchReaderBuilder::try_new(file)?.build()?;
    let batch = reader.next().unwrap()?;

    let new_with_stats_path = "/home/jeffrey/Downloads/new_with_stats.parquet";
    let file = File::create(new_with_stats_path)?;
    let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?;
    writer.write(&batch)?;
    writer.close()?;

    println!("checking file after rewritten by latest parquet-rs");
    let file = File::open(new_with_stats_path)?;
    let reader = SerializedFileReader::new(file)?;
    dbg!(reader.metadata().file_metadata().column_order(0));

    Ok(())
}

And the output:

checking file from issue
[parquet/./examples/read_parquet.rs:30] reader.metadata().file_metadata().column_order(0) = UNDEFINED
checking file after rewritten by pyarrow
[parquet/./examples/read_parquet.rs:36] reader.metadata().file_metadata().column_order(0) = TYPE_DEFINED_ORDER(
    UNSIGNED,
)
rewriting file from issue with latest parquet-rs
checking file after rewritten by latest parquet-rs
[parquet/./examples/read_parquet.rs:52] reader.metadata().file_metadata().column_order(0) = TYPE_DEFINED_ORDER(
    UNSIGNED,
)

We can see that original no_stats.parquet (from here #5270 (comment)) has an undefined column order. After following steps to rewrite that file using pyarrow, can see it is now populated.

Now I run with latest master of parquet-rs, rewriting that original file as was done for pyarrow, and can see now the column order is defined.

When I check this new parquet file written by latest parquet-rs master branch on pyarrow, statistics are coming through now:

>>> pq.ParquetFile("/home/jeffrey/Downloads/new_with_stats.parquet").metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f2bf82d92b0>
  file_offset: 61
  file_path:
  physical_type: BYTE_ARRAY
  num_values: 1
  path_in_schema: x
  is_stats_set: True
  statistics:
    <pyarrow._parquet.Statistics object at 0x7f2bf82d94e0>
      has_min_max: True
      min: 01
      max: 01
      null_count: None
      distinct_count: None
      num_values: 1
      physical_type: BYTE_ARRAY
      logical_type: String
      converted_type (legacy): UTF8
  compression: UNCOMPRESSED
  encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
  has_dictionary_page: True
  dictionary_page_offset: 4
  data_page_offset: 24
  total_compressed_size: 57
  total_uncompressed_size: 57
>>>

Could you test again from the arrow-rs master branch, to see if this resolves the issue?

This fix should come in arrow-rs release 50.0.0 (not included in 49.0.0 which is the current latest on crates.io), see tracking for the release: #5234

tustvold · 2024-01-19T10:57:22Z

I believe this is now fixed and released in arrow 50.0.0. Please feel free to reopen if I am mistaken

pacman82 · 2024-01-20T15:20:06Z

Hello @tustvold ,

thanks for the information. I'll try to see if the behavior has now vanished in odbc2parquet.

pacman82 · 2024-01-20T15:32:11Z

Yep, I can now also read statistics in Python, with the files written by odbc2parquet.

pacman82 · 2024-01-20T15:33:25Z

Thanks!

jonashaag added the bug label Jan 2, 2024

jonashaag mentioned this issue Jan 2, 2024

Write statistics pacman82/odbc2parquet#486

Closed

tustvold closed this as completed Jan 19, 2024

tustvold added the parquet Changes to the parquet crate label Mar 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to enable statistics for string columns? #5270

How to enable statistics for string columns? #5270

jonashaag commented Jan 2, 2024

tustvold commented Jan 2, 2024 •

edited

Loading

jonashaag commented Jan 2, 2024

jonashaag commented Jan 2, 2024

tustvold commented Jan 2, 2024 •

edited

Loading

pacman82 commented Jan 2, 2024

tustvold commented Jan 2, 2024

Jefffrey commented Jan 2, 2024

tustvold commented Jan 19, 2024

pacman82 commented Jan 20, 2024

pacman82 commented Jan 20, 2024

pacman82 commented Jan 20, 2024

How to enable statistics for string columns? #5270

How to enable statistics for string columns? #5270

Comments

jonashaag commented Jan 2, 2024

tustvold commented Jan 2, 2024 • edited Loading

jonashaag commented Jan 2, 2024

jonashaag commented Jan 2, 2024

tustvold commented Jan 2, 2024 • edited Loading

pacman82 commented Jan 2, 2024

tustvold commented Jan 2, 2024

Jefffrey commented Jan 2, 2024

tustvold commented Jan 19, 2024

pacman82 commented Jan 20, 2024

pacman82 commented Jan 20, 2024

pacman82 commented Jan 20, 2024

tustvold commented Jan 2, 2024 •

edited

Loading

tustvold commented Jan 2, 2024 •

edited

Loading