-
Notifications
You must be signed in to change notification settings - Fork 996
Closed
Labels
buggood first issueGood for newcomersGood for newcomershelp wantedparquetChanges to the parquet crateChanges to the parquet crate
Description
Describe the bug
The arrow writer doesn't track memory size correctly, and it seems like it thinks FixedSizeList
columns have a fixed memory usage. Ie. the reported memory usage doesn't grow despite the buffers actually growing in memory.
To Reproduce
[package]
name = "repro"
version = "0.1.0"
edition = "2021"
[dependencies]
arrow = "53.3.0"
parquet = "53.3.0"
rand = "0.8.5"
use arrow::array::{FixedSizeListBuilder, UInt8Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
use rand::Rng;
use std::fs::File;
use std::sync::Arc;
fn main() -> Result<(), Box<dyn std::error::Error>> {
// Define the field and schema for a single column that is a fixed-size list of floats.
let list_length = 1_048_576;
let field = Field::new(
"mylist",
DataType::FixedSizeList(Arc::new(Field::new("item", DataType::UInt8, true)), list_length),
true,
);
let schema = Arc::new(Schema::new(vec![field]));
// Create a writer for the Parquet file
let file = File::create("output_randomized.parquet")?;
let props = WriterProperties::builder().build();
let mut writer = ArrowWriter::try_new(file, schema.clone(), Some(props))?;
let iterations = 10000;
let values_per_batch = list_length;
let mut list_arr_builder = FixedSizeListBuilder::new(UInt8Builder::new(), list_length);
for _ in 0..iterations {
// Generate random data for the values array
let mut rng = rand::thread_rng();
let values: Vec<u8> = (0..values_per_batch)
.map(|_| rng.gen())
.collect();
list_arr_builder.values().append_slice(&values);
list_arr_builder.append(true);
let output = list_arr_builder.finish();
let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(output)])?;
let in_memory_size = writer.memory_size() + writer.in_progress_size();
let before_in_memory_size_mb = (in_memory_size as f64) / (1024f64.powi(2));
writer.write(&batch)?;
let in_memory_size = writer.memory_size() + writer.in_progress_size();
let after_in_memory_size_mb = (in_memory_size as f64) / (1024f64.powi(2));
let change_in_usage = before_in_memory_size_mb - after_in_memory_size_mb;
dbg!(change_in_usage, after_in_memory_size_mb, before_in_memory_size_mb);
}
writer.close()?;
Ok(())
}
Expected behavior
We should see the reported memory usage rise over time, then as flush is triggered, it should go down to around zero. Then repeat.
Metadata
Metadata
Assignees
Labels
buggood first issueGood for newcomersGood for newcomershelp wantedparquetChanges to the parquet crateChanges to the parquet crate