Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++, Parquet] dictionary(..., large_string) type not preserved when writing to Parquet #37875

Open
mattaubury opened this issue Sep 26, 2023 · 3 comments

Comments

@mattaubury
Copy link

mattaubury commented Sep 26, 2023

(tested on pyarrow-13.0.0, Linux x64)

When writing a dictionary encoded column of large_string type to Parquet file and reading it back, it is read back as a plain string type.

Repro in Python but I see the same in C++:

>>> import pyarrow as pa
>>> import pyarrow.compute as pc
>>> import pyarrow.parquet as pq

>>> strings = pc.dictionary_encode(pa.array(["foo, bar, foo"], pa.large_string()))
>>> table = pa.table([strings], ["strings"])
>>> table.schema
strings: dictionary<values=large_string, indices=int32, ordered=0>

>>> pq.write_table(table, "table.parquet")
>>> pq.read_table("table.parquet").schema
strings: dictionary<values=string, indices=int32, ordered=0>

I'd expect to get it back as a dictionary<values=large_string, indices=int32, ordered=0> type.

Component(s)

C++, Parquet, Python

@mapleFU
Copy link
Member

mapleFU commented Sep 26, 2023

Currently we're not able to read large-dictionary, and #35825 This patch is unmerged

@mattaubury
Copy link
Author

Currently we're not able to read large-dictionary, and #35825 This patch is unmerged

Hmm, it seems to read fine, just as a non-large string.

@mapleFU
Copy link
Member

mapleFU commented Nov 19, 2023

@mattaubury Parquet Internal doesn't regard dictionary as a "type", instead, it regard it as an "encoding", that is saied, writing dictionary to parquet doesn't means the encoding is dictionary. During read, currently because of the patch is un-merged, so we're unable to read large-binary dict from parquet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants