Skip to content
This repository has been archived by the owner on Feb 18, 2024. It is now read-only.

Support more parquet encoding #274

Open
ives9638 opened this issue Aug 11, 2021 · 2 comments
Open

Support more parquet encoding #274

ives9638 opened this issue Aug 11, 2021 · 2 comments
Labels
question Further information is requested

Comments

@ives9638
Copy link

At present, all datatypes are plain

For example, deltabitpackencoder: in arrow rs

https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding.rs#L624

arrow2 :

pub fn can_encode(data_type: &DataType, encoding: Encoding) -> bool {

Is there a plan to implement the complete encoding function?

@jorgecarleitao jorgecarleitao added the question Further information is requested label Aug 11, 2021
@jorgecarleitao
Copy link
Owner

Thanks for the issue!

We have some encoders implemented in parquet2 here. arrow2 expose them to some datatypes, as encoding is an argument of the write APIs like you pointed to.

I have not implemented the remaining primarily because it has been a bit difficult for me to find parquet readers that support them, making it difficult to prove interoperability. For example,

  • pyarrow: does not support DeltaLengthByteArray yet ARROW-13388,
  • pyarrow: reading dictionary-encoded has been challenging ARROW-13487
  • spark: only the non-vectorized reader supports DeltaLengthByteArray (see here and this thread on parquet's mailing list)

It is also a bit difficult to me to reproduce parquet's current behavior because the parquet crate has no integration tests against e.g. pyarrow or spark. I.e. we have to trust that it is well implemented and that consumers can read from it.

Since parquet is a storage format and not being able to read stored data is not a pleasant experience, I am defensive and require integration tests against at least a well known consumer before exposing the encoding for writing.

Do you have an encoding (parquet version, physical type, encoding) that you have in mind that we should support?

@jorgecarleitao jorgecarleitao changed the title For parquet encoding Support more parquet encoding Aug 11, 2021
@ives9638
Copy link
Author

First of all, thank you very much for replying.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants