ARROW-1976: [Python] Handling unicode pandas columns on parquet.read_table #1476

Licht-T · 2018-01-14T13:00:36Z

This closes ARROW-1976.

xhochy · 2018-01-14T13:03:10Z

python/pyarrow/pandas_compat.py

This should also be six.text_type so that we get unicode in Python 2. Probably, using .decode('utf-8') might also be the better option.

There is also frombytes in pyarrow.compat

Licht-T · 2018-01-25T00:18:56Z

Thanks @xhochy, fixed!

xhochy · 2018-01-25T22:33:08Z

python/pyarrow/pandas_compat.py

frombytes works only on bytes. Thus the above code is valid in Python 2 but breaks the unittests for Python 3. Removing the str should fix this.

wesm · 2018-01-30T20:04:56Z

I'm looking at this patch to get it passing

Licht-T · 2018-01-30T20:15:26Z

Sorry @wesm! Ths totally slipped my mind!

wesm · 2018-02-01T13:57:00Z

Can someone look at getting this ready to merge? I'm drowning trying to get on top of the PR queue and e-mail firehose. @cpcloud @xhochy

ylogx · 2018-02-01T14:37:31Z

I'd like to help. If it's fine with @Licht-T, I can pull his branch and create a new PR to add the required changes there.

simnyatsanga · 2018-02-01T15:13:49Z

@Licht-T @wesm @ylogx I made a PR with the requested feedback in this main PR here: Licht-T#1 . Hopefully this is ok.

xhochy · 2018-02-01T18:16:06Z

Together with the changes from @simnyatsanga this is good to go.

Removing additional instances of using frombytes with str. Removing additional instances of using frombytes with str.

wesm · 2018-02-01T22:29:05Z

Cool, I just merged the changes, will await the CI to run

wesm · 2018-02-02T05:14:51Z

This is still failing

simnyatsanga · 2018-02-02T14:32:35Z

I'm looking at getting the failing tests to pass. Specifically one of the failing tests looks like this
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_convert_pandas.py#L141

def test_multiindex_columns(self):
        columns = pd.MultiIndex.from_arrays([
            ['one', 'two'], ['X', 'Y']
        ])
        df = pd.DataFrame([(1, 'a'), (2, 'b'), (3, 'c')], columns=columns)
        _check_pandas_roundtrip(df, preserve_index=True)

The pandas roundtrip is failing in this branch on this line: https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L166
Because the column_name (which is part of a MultiIndex) is coming in as a tuple instead of a string. On master the column_name is indeed coming in as a string. I'm trying to figure out, what changes on this branch would cause this.

cpcloud · 2018-02-02T17:30:07Z

@Licht-T @wesm @simnyatsanga I'm taking a look at this now.

cpcloud · 2018-02-02T18:17:56Z

This is failing because of an assumption about the behavior of frombytes. Pushing up a fix shortly.

cpcloud · 2018-02-03T01:39:18Z

@Licht-T closing in favor of #1553

xhochy reviewed Jan 14, 2018

View reviewed changes

xhochy reviewed Jan 25, 2018

View reviewed changes

wesm force-pushed the fix-unicode-serde-in-py27 branch from 97f003b to e95b5f1 Compare January 30, 2018 16:45

simnyatsanga mentioned this pull request Feb 1, 2018

ARROW-1976: [Python] PR Feedback for Fix Pandas data SerDe with Unicode column names in Python 2.7 Licht-T/arrow#1

Merged

Licht-T and others added 4 commits February 1, 2018 17:28

BUG: Fix Pandas data SerDe with Unicode column names in Python 2.7

4fc6743

TST: Add tests for Pandas data SerDe with Unicode column names

3f5416d

BUG: Convert str by frombytes on pandas_compat.py

1f163a0

Not using str with frombytes to ensure Python3 tests pass.

0a12652

Removing additional instances of using frombytes with str. Removing additional instances of using frombytes with str.

wesm force-pushed the fix-unicode-serde-in-py27 branch from 15a2366 to 0a12652 Compare February 1, 2018 22:28

cpcloud changed the title ~~ARROW-1976: [Python] Fix Pandas data SerDe with Unicode column names in Python 2.7~~ ARROW-1976: [Python] Handling unicode pandas columns on parquet.read_table Feb 2, 2018

cpcloud closed this Feb 3, 2018

asfimport mentioned this pull request Feb 6, 2018

[Python] Handling unicode pandas columns on parquet.read_table #15643

Closed

Uh oh!

ARROW-1976: [Python] Handling unicode pandas columns on parquet.read_table #1476

ARROW-1976: [Python] Handling unicode pandas columns on parquet.read_table #1476

Uh oh!

Conversation

Licht-T commented Jan 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xhochy Jan 14, 2018

Choose a reason for hiding this comment

Uh oh!

wesm Jan 17, 2018

Choose a reason for hiding this comment

Uh oh!

Licht-T commented Jan 25, 2018

Uh oh!

xhochy Jan 25, 2018

Choose a reason for hiding this comment

Uh oh!

wesm commented Jan 30, 2018

Uh oh!

Licht-T commented Jan 30, 2018

Uh oh!

wesm commented Feb 1, 2018

Uh oh!

ylogx commented Feb 1, 2018

Uh oh!

simnyatsanga commented Feb 1, 2018

Uh oh!

xhochy commented Feb 1, 2018

Uh oh!

wesm commented Feb 1, 2018

Uh oh!

wesm commented Feb 2, 2018

Uh oh!

simnyatsanga commented Feb 2, 2018

Uh oh!

cpcloud commented Feb 2, 2018

Uh oh!

cpcloud commented Feb 2, 2018

Uh oh!

cpcloud commented Feb 3, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Licht-T commented Jan 14, 2018 •

edited

Loading