-
Notifications
You must be signed in to change notification settings - Fork 796
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to enable statistics for string columns? #5270
Comments
Perhaps you might provide a reproducer for this, or at the very least an example file exhibiting this property. I wonder if this might be a quirk of pyarrow... The following test passes
|
|
I'm pretty sure it's not a quirk of PyArrow. I realized that filtering with Polars/DuckDB on string columns on Parquet files created with the odbc2parquet tool is much slower than on files created with PyArrow. |
Ok so the statistics are present in the file
So the question is now why pyarrow is unhappy with those statistics, I vaguely remember some bug/limitation in pyarrow related to this - let me see if I can dig it out.
This could be for a very large number of reasons Edit: I can't find the exact issue I am looking for but these are all related
Perhaps @mapleFU you might have the relevant information to hand? |
Hello @tustvold , I do not know if this helps to clear things up, or causes more confusion, but here is my attempt to reproduce the issue directly with #[test]
fn write_statistics_for_text_columns() {
// Setup table for test
let table_name = "WriteStatisticsForTextColumns";
let mut table = TableMssql::new(table_name, &["VARCHAR(10)"]);
table.insert_rows_as_text(&[["aaa"], ["zzz"]]);
let query = format!("SELECT a FROM {table_name}");
let command = Command::cargo_bin("odbc2parquet")
.unwrap()
.args([
"query",
"--connection-string",
MSSQL,
"-", // Use `-` to explicitly write to stdout
&query,
])
.assert()
.success();
// Then
let bytes = Bytes::from(command.get_output().stdout.clone());
let reader = SerializedFileReader::new(bytes).unwrap();
let stats = reader.metadata().row_group(0).column(0).statistics().unwrap();
assert_eq!("aaa", str::from_utf8(stats.min_bytes()).unwrap());
assert_eq!("zzz", str::from_utf8(stats.max_bytes()).unwrap());
} The above code executes and the tests passes. This hints that at reading the file with the Rust parquet crate the statistics are present. Yet they seem to be written differently to what the Python stack seems to expect. Best, Markus |
Yes, I think we need the python/arrow-cpp developers to weigh in here as to what is going on |
I think this is actually due to #5158 only having recently been merged in. Some reproduction code: use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use parquet::arrow::ArrowWriter;
use parquet::errors::Result;
use parquet::file::reader::FileReader;
use parquet::file::serialized_reader::SerializedFileReader;
use std::fs::File;
fn main() -> Result<()> {
println!("checking file from issue");
let no_stats_path = "/home/jeffrey/Downloads/no_stats.parquet";
let file = File::open(no_stats_path)?;
let reader = SerializedFileReader::new(file)?;
dbg!(reader.metadata().file_metadata().column_order(0));
println!("checking file after rewritten by pyarrow");
let with_stats_path = "/home/jeffrey/Downloads/with_stats.parquet";
let file = File::open(with_stats_path)?;
let reader = SerializedFileReader::new(file)?;
dbg!(reader.metadata().file_metadata().column_order(0));
println!("rewriting file from issue with latest parquet-rs");
let file = File::open(no_stats_path)?;
let mut reader = ParquetRecordBatchReaderBuilder::try_new(file)?.build()?;
let batch = reader.next().unwrap()?;
let new_with_stats_path = "/home/jeffrey/Downloads/new_with_stats.parquet";
let file = File::create(new_with_stats_path)?;
let mut writer = ArrowWriter::try_new(file, batch.schema(), None)?;
writer.write(&batch)?;
writer.close()?;
println!("checking file after rewritten by latest parquet-rs");
let file = File::open(new_with_stats_path)?;
let reader = SerializedFileReader::new(file)?;
dbg!(reader.metadata().file_metadata().column_order(0));
Ok(())
} And the output:
We can see that original Now I run with latest master of parquet-rs, rewriting that original file as was done for pyarrow, and can see now the column order is defined. When I check this new parquet file written by latest parquet-rs master branch on pyarrow, statistics are coming through now: >>> pq.ParquetFile("/home/jeffrey/Downloads/new_with_stats.parquet").metadata.row_group(0).column(0)
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f2bf82d92b0>
file_offset: 61
file_path:
physical_type: BYTE_ARRAY
num_values: 1
path_in_schema: x
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7f2bf82d94e0>
has_min_max: True
min: 01
max: 01
null_count: None
distinct_count: None
num_values: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
compression: UNCOMPRESSED
encodings: ('PLAIN', 'RLE', 'RLE_DICTIONARY')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 24
total_compressed_size: 57
total_uncompressed_size: 57
>>> Could you test again from the arrow-rs master branch, to see if this resolves the issue? This fix should come in arrow-rs release 50.0.0 (not included in 49.0.0 which is the current latest on crates.io), see tracking for the release: #5234 |
I believe this is now fixed and released in arrow 50.0.0. Please feel free to reopen if I am mistaken |
Hello @tustvold , thanks for the information. I'll try to see if the behavior has now vanished in odbc2parquet. |
Yep, I can now also read statistics in Python, with the files written by |
Thanks! |
Describe the bug
I'm using the https://github.com/pacman82/odbc2parquet library that is based on this crate.
I observe that statistics like min/max are not written for string columns:
Relevant code: https://github.com/pacman82/odbc2parquet/blob/b571cad6fae1b58e1aab8348f14b32f20d6ec165/src/query/parquet_writer.rs#L47
To Reproduce
Use odbc2parquet to download any table that contains a string column
Expected behavior
Should have min/max statistics.
Additional context
The text was updated successfully, but these errors were encountered: