Skip to content

Reading UTF-8/JSON/ENUM field results in a lot of vec allocation #58

@alamb

Description

@alamb

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-7252

While reading a very large parquet file with basically all string fields was very slow(430MB gzipped), after profiling with osx instruments, I noticed that a lot of time is spent in "convert_byte_array", in particular, "reserving" and allocating Vec::with_capacity, which is done before String::from_utf8_unchecked.

It seems like using String as the underlying storage is causing this(String uses Vec for its underlying storage), this also requires copying from slice to vec.

"Field::Str" is a pub enum so I am not sure how "refactorable" is the String part, for example, converting it into a &str(we can perhaps then defer the conversion from &[u8] to Vec until the user really needs a String)

But of course, changing it to &str can result in quite a bit of interface changes... So I am wondering if there are already some plans or solution on the way to improve the handling of the "Field::Str" case?

 

Metadata

Metadata

Assignees

No one assigned

    Labels

    parquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions