-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-92: Arrow to Parquet Schema conversion #68
Conversation
We'll have to make some decisions about type mappings. For example:
For List types, we should use the 3-level array encoding as described here https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema/types.h#L49 What other parts do you think are underspecified? |
For |
I see. For decimals, I agree we either need multiple Arrow types, or add metadata indicating the physical storage type to the DecimalType. I would say it's better to make this explicit in the Arrow data type, let me know what you think |
Probably a simple |
PR is now in state for a minimal schema conversion basis for Pandas<->Parquet. |
break; | ||
case Type::CHAR: | ||
type = ParquetType::FIXED_LEN_BYTE_ARRAY; | ||
logical_type = LogicalType::UTF8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aside: we'll need to visit the string encoding question, as logical unicode characters won't map neatly onto a char(n)
type
This looks good outside the exception handling question |
+1, thank you |
* Fix missing set the include directory of gtest * Fix to use same format as other dependencies
* Fix missing set the include directory of gtest * Fix to use same format as other dependencies
* Fix missing set the include directory of gtest * Fix to use same format as other dependencies
* Fix missing set the include directory of gtest * Fix to use same format as other dependencies
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com> Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits: 18dca87 [Aliaksei Sandryhaila] Added a unit test. dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed.
* Fix missing set the include directory of gtest * Fix to use same format as other dependencies
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com> Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits: 18dca87 [Aliaksei Sandryhaila] Added a unit test. dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed. Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com> Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits: 18dca87 [Aliaksei Sandryhaila] Added a unit test. dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed. Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com> Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits: 18dca87 [Aliaksei Sandryhaila] Added a unit test. dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed. Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com> Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits: 18dca87 [Aliaksei Sandryhaila] Added a unit test. dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed. Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a
* Fix missing set the include directory of gtest * Fix to use same format as other dependencies
* Fix missing set the include directory of gtest * Fix to use same format as other dependencies
* Offset buffer can be pre-grown in Parquet ByteArray reader * nit
* Initial commit * Introduce TranslateHolder * Remove unused header
* Initial commit * Introduce TranslateHolder * Remove unused header
* Add translate expression support (apache#68) * Initial commit * Introduce TranslateHolder * Remove unused header * Return 1 if empty string is given as substring (apache#69) * Add two math operations: floor & ceil (apache#72) * Inital commit * Add ceil function Co-authored-by: PHILO-HE <feilong.he@intel.com>
* Initial commit * Introduce TranslateHolder * Remove unused header
* Initial commit * Introduce TranslateHolder * Remove unused header
* Initial commit * Introduce TranslateHolder * Remove unused header
My current WIP state. To make the actual schema conversion complete, we probably need the physical structure too as Arrow schemas only care about logical types whereas Parquet schema is about logical and physical types.