Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-92: Arrow to Parquet Schema conversion #68

Closed
wants to merge 6 commits into from

Conversation

xhochy
Copy link
Member

@xhochy xhochy commented Apr 24, 2016

My current WIP state. To make the actual schema conversion complete, we probably need the physical structure too as Arrow schemas only care about logical types whereas Parquet schema is about logical and physical types.

@wesm
Copy link
Member

wesm commented Apr 24, 2016

We'll have to make some decisions about type mappings. For example:

  • arrow::StringType becomes BYTE_ARRAY with UTF8 annotation
  • arrow::BinaryType (needs to be implemented) becomes BYTE_ARRAY with no ConvertedType
  • arrow::CharType (if ever used, we can skip it for now) becomes FIXED_LEN_BYTE_ARRAY

For List types, we should use the 3-level array encoding as described here https://github.com/apache/parquet-cpp/blob/master/src/parquet/schema/types.h#L49

What other parts do you think are underspecified?

@xhochy
Copy link
Member Author

xhochy commented Apr 25, 2016

For Decimal we need to decide if use the smallest possible physical type is the correct strategy.

@wesm
Copy link
Member

wesm commented Apr 25, 2016

I see. For decimals, I agree we either need multiple Arrow types, or add metadata indicating the physical storage type to the DecimalType. I would say it's better to make this explicit in the Arrow data type, let me know what you think

@xhochy
Copy link
Member Author

xhochy commented May 1, 2016

Probably a simple storage_type field could be enough for the DecimalType. As this probably needs to go into the spec, I made separate issues for this https://issues.apache.org/jira/browse/ARROW-183 and https://issues.apache.org/jira/browse/ARROW-184

@xhochy xhochy changed the title [WIP] ARROW-92: Arrow to Parquet Schema conversion ARROW-92: Arrow to Parquet Schema conversion May 1, 2016
@xhochy
Copy link
Member Author

xhochy commented May 1, 2016

PR is now in state for a minimal schema conversion basis for Pandas<->Parquet.

break;
case Type::CHAR:
type = ParquetType::FIXED_LEN_BYTE_ARRAY;
logical_type = LogicalType::UTF8;
Copy link
Member

@wesm wesm May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside: we'll need to visit the string encoding question, as logical unicode characters won't map neatly onto a char(n) type

@wesm
Copy link
Member

wesm commented May 1, 2016

This looks good outside the exception handling question

@wesm
Copy link
Member

wesm commented May 1, 2016

+1, thank you

@asfgit asfgit closed this in 355f7c9 May 1, 2016
@xhochy xhochy deleted the arrow-92 branch March 7, 2017 16:16
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
* Fix missing set the include directory of gtest
* Fix to use same format as other dependencies
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
* Fix missing set the include directory of gtest
* Fix to use same format as other dependencies
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
* Fix missing set the include directory of gtest
* Fix to use same format as other dependencies
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Aug 30, 2018
* Fix missing set the include directory of gtest
* Fix to use same format as other dependencies
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 2, 2018
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com>

Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits:

18dca87 [Aliaksei Sandryhaila] Added a unit test.
dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed.
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 4, 2018
* Fix missing set the include directory of gtest
* Fix to use same format as other dependencies
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 4, 2018
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com>

Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits:

18dca87 [Aliaksei Sandryhaila] Added a unit test.
dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed.

Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 6, 2018
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com>

Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits:

18dca87 [Aliaksei Sandryhaila] Added a unit test.
dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed.

Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 7, 2018
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com>

Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits:

18dca87 [Aliaksei Sandryhaila] Added a unit test.
dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed.

Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a
wesm pushed a commit to wesm/arrow that referenced this pull request Sep 8, 2018
Author: Aliaksei Sandryhaila <aliaksei.sandryhaila@hp.com>

Closes apache#68 from asandryh/PARQUET-537 and squashes the following commits:

18dca87 [Aliaksei Sandryhaila] Added a unit test.
dfb7a0b [Aliaksei Sandryhaila] PARQUET-537: Ensure that LocalFileSource is properly closed.

Change-Id: I9f2544a51e350464983f7ca511970b434d009f3a
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 10, 2018
* Fix missing set the include directory of gtest
* Fix to use same format as other dependencies
praveenbingo pushed a commit to praveenbingo/arrow that referenced this pull request Sep 10, 2018
* Fix missing set the include directory of gtest
* Fix to use same format as other dependencies
xuechendi pushed a commit to xuechendi/arrow that referenced this pull request Aug 4, 2020
* Offset buffer can be pre-grown in Parquet ByteArray reader

* nit
zhouyuan pushed a commit to zhouyuan/arrow that referenced this pull request Jan 6, 2022
* Initial commit

* Introduce TranslateHolder

* Remove unused header
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Jan 7, 2022
* Initial commit

* Introduce TranslateHolder

* Remove unused header
zhouyuan added a commit to zhouyuan/arrow that referenced this pull request Jan 9, 2022
* Add translate expression support (apache#68)

* Initial commit

* Introduce TranslateHolder

* Remove unused header

* Return 1 if empty string is given as substring (apache#69)

* Add two math operations: floor & ceil (apache#72)

* Inital commit

* Add ceil function

Co-authored-by: PHILO-HE <feilong.he@intel.com>
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Feb 8, 2022
* Initial commit

* Introduce TranslateHolder

* Remove unused header
zhztheplayer pushed a commit to zhztheplayer/arrow-1 that referenced this pull request Mar 3, 2022
* Initial commit

* Introduce TranslateHolder

* Remove unused header
rui-mo pushed a commit to rui-mo/arrow-1 that referenced this pull request Mar 23, 2022
* Initial commit

* Introduce TranslateHolder

* Remove unused header
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants