Skip to content

parquet arrow writer doesn't track memory size correctly for fixed sized lists #6839

@kszlim

Description

@kszlim

Describe the bug
The arrow writer doesn't track memory size correctly, and it seems like it thinks FixedSizeList columns have a fixed memory usage. Ie. the reported memory usage doesn't grow despite the buffers actually growing in memory.

To Reproduce

[package]
name = "repro"
version = "0.1.0"
edition = "2021"

[dependencies]
arrow = "53.3.0"
parquet = "53.3.0"
rand = "0.8.5"
use arrow::array::{FixedSizeListBuilder, UInt8Builder};
use arrow::datatypes::{DataType, Field, Schema};
use arrow::record_batch::RecordBatch;
use parquet::arrow::ArrowWriter;
use parquet::file::properties::WriterProperties;
use rand::Rng;
use std::fs::File;
use std::sync::Arc;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Define the field and schema for a single column that is a fixed-size list of floats.
    let list_length = 1_048_576;
    let field = Field::new(
        "mylist",
        DataType::FixedSizeList(Arc::new(Field::new("item", DataType::UInt8, true)), list_length),
        true,
    );
    let schema = Arc::new(Schema::new(vec![field]));

    // Create a writer for the Parquet file
    let file = File::create("output_randomized.parquet")?;
    let props = WriterProperties::builder().build();
    let mut writer = ArrowWriter::try_new(file, schema.clone(), Some(props))?;

    let iterations = 10000;
    let values_per_batch = list_length;

    let mut list_arr_builder = FixedSizeListBuilder::new(UInt8Builder::new(), list_length);
    for _ in 0..iterations {
        // Generate random data for the values array
        let mut rng = rand::thread_rng();
        let values: Vec<u8> = (0..values_per_batch)
            .map(|_| rng.gen())
            .collect();

        list_arr_builder.values().append_slice(&values);
        list_arr_builder.append(true);
        let output = list_arr_builder.finish();
        let batch = RecordBatch::try_new(schema.clone(), vec![Arc::new(output)])?;
        let in_memory_size = writer.memory_size() + writer.in_progress_size();
        let before_in_memory_size_mb = (in_memory_size as f64) / (1024f64.powi(2));
        writer.write(&batch)?;
        let in_memory_size = writer.memory_size() + writer.in_progress_size();
        let after_in_memory_size_mb = (in_memory_size as f64) / (1024f64.powi(2));
        let change_in_usage = before_in_memory_size_mb - after_in_memory_size_mb;
        dbg!(change_in_usage, after_in_memory_size_mb, before_in_memory_size_mb);
    }

    writer.close()?;

    Ok(())
}

Expected behavior
We should see the reported memory usage rise over time, then as flush is triggered, it should go down to around zero. Then repeat.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions