Skip to content

Conversation

@Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Jan 14, 2018

This closes ARROW-1976.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be six.text_type so that we get unicode in Python 2. Probably, using .decode('utf-8') might also be the better option.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also frombytes in pyarrow.compat

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 25, 2018

Thanks @xhochy, fixed!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

frombytes works only on bytes. Thus the above code is valid in Python 2 but breaks the unittests for Python 3. Removing the str should fix this.

@wesm wesm force-pushed the fix-unicode-serde-in-py27 branch from 97f003b to e95b5f1 Compare January 30, 2018 16:45
@wesm
Copy link
Member

wesm commented Jan 30, 2018

I'm looking at this patch to get it passing

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 30, 2018

Sorry @wesm! Ths totally slipped my mind!

@wesm
Copy link
Member

wesm commented Feb 1, 2018

Can someone look at getting this ready to merge? I'm drowning trying to get on top of the PR queue and e-mail firehose. @cpcloud @xhochy

@ylogx
Copy link

ylogx commented Feb 1, 2018

I'd like to help. If it's fine with @Licht-T, I can pull his branch and create a new PR to add the required changes there.

@simnyatsanga
Copy link
Contributor

@Licht-T @wesm @ylogx I made a PR with the requested feedback in this main PR here: Licht-T#1 . Hopefully this is ok.

@xhochy
Copy link
Member

xhochy commented Feb 1, 2018

Together with the changes from @simnyatsanga this is good to go.

@wesm wesm force-pushed the fix-unicode-serde-in-py27 branch from 15a2366 to 0a12652 Compare February 1, 2018 22:28
@wesm
Copy link
Member

wesm commented Feb 1, 2018

Cool, I just merged the changes, will await the CI to run

@wesm
Copy link
Member

wesm commented Feb 2, 2018

This is still failing

@simnyatsanga
Copy link
Contributor

I'm looking at getting the failing tests to pass. Specifically one of the failing tests looks like this
https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_convert_pandas.py#L141

def test_multiindex_columns(self):
        columns = pd.MultiIndex.from_arrays([
            ['one', 'two'], ['X', 'Y']
        ])
        df = pd.DataFrame([(1, 'a'), (2, 'b'), (3, 'c')], columns=columns)
        _check_pandas_roundtrip(df, preserve_index=True)

The pandas roundtrip is failing in this branch on this line: https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py#L166
Because the column_name (which is part of a MultiIndex) is coming in as a tuple instead of a string. On master the column_name is indeed coming in as a string. I'm trying to figure out, what changes on this branch would cause this.

@cpcloud
Copy link
Contributor

cpcloud commented Feb 2, 2018

@Licht-T @wesm @simnyatsanga I'm taking a look at this now.

@cpcloud
Copy link
Contributor

cpcloud commented Feb 2, 2018

This is failing because of an assumption about the behavior of frombytes. Pushing up a fix shortly.

@cpcloud cpcloud changed the title ARROW-1976: [Python] Fix Pandas data SerDe with Unicode column names in Python 2.7 ARROW-1976: [Python] Handling unicode pandas columns on parquet.read_table Feb 2, 2018
@cpcloud
Copy link
Contributor

cpcloud commented Feb 3, 2018

@Licht-T closing in favor of #1553

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants