Skip to content

ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata #1397

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 9 commits into from

Conversation

cpcloud
Copy link
Contributor

@cpcloud cpcloud commented Dec 6, 2017

No description provided.

@wesm
Copy link
Member

wesm commented Dec 6, 2017

@jorisvandenbossche would you mind reviewing and making sure this jives with the discussion so far?

@jorisvandenbossche
Copy link
Member

One special case that I encountered in #1386 is a DataFrame with column name None (from ipc when serializing a Series without name).
This case is not yet handled here:

In [6]: pa.Table.from_pandas(pd.DataFrame({None: [1,2,3]}))
Out[6]: 
pyarrow.Table
None: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "pandas_type": "mixed", "numpy_type": "object", "meta'
            b'data": null}], "columns": [{"name": null, "field_name": null, "p'
            b'andas_type": "int64", "numpy_type": "int64", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.22.0.dev0+260.g5da3759"}'}

So for the column, "name": null, "field_name": null, are both null, while field_name should be "None"

@cpcloud
Copy link
Contributor Author

cpcloud commented Dec 6, 2017

@jorisvandenbossche How is it possible in practice to get a Series with a name of None? How can one exist without explicitly constructing it or assigning its name to be None?

@jorisvandenbossche
Copy link
Member

Indeed by just constructing it and not getting it as a column from a dataframe.
Such a dataframe as I show above is constructed in the ipc code to serialize Series objects:

def _serialize_pandas_series(obj):
return _serialize_pandas_dataframe(pd.DataFrame({obj.name: obj}))

It is also tested explicitly: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L458

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from the None column name comment, looks good to me! And resembles the discussion IMO.

Added some minor comments.

@@ -132,7 +132,7 @@ def get_extension_dtype_info(column):
return physical_dtype, metadata


def get_column_metadata(column, name, arrow_type):
def get_column_metadata(column, name, arrow_type, field_name):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe update the docstring here as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

zip(index_levels, index_types)
get_column_metadata(
level,
name=level.name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

column names are automatically stringified, should the same be done for index columns? (so str(level.name))
(although I don't find it that important to support, as it is never roundtrippable)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to preserve the None value when parsing null in JSON.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, of course. (So for columns it is converted to string when not None)

@@ -450,9 +459,31 @@ def table_to_blockmanager(options, table, memory_pool, nthreads=1,

block_table = table

index_columns_set = frozenset(index_columns)

# 1. 'field_name' is the user-facing name of the column in an arrow Table
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the comment below you already describe the logical name as the "user-facing" name ("There must be the same number of logical names (user-facing) and physical names (fields in the arrow Table)")
I would add here as well an explanation how a "logical name" should be interpreted for clarity

idx0_name, foo_name = js['index_columns']
assert idx0_name == '__index_level_0__'
assert idx0['field_name'] == idx0_name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also assert that idx0['name'] is None, or is that already tested elsewhere adequately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that


assert foo_name == '__index_level_1__'
assert foo['name'] == 'foo'

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For completeness, I would assert that for the other columns, the name and field_name are equal (this is obvious from the current code, but not explicitly tested):

col1, col2, idx0, foo = js['columns']

assert col1['name'] == col1['field_name']
assert col2['name'] == col2['field_name']

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, will do


assert col1['name'] == col1['field_name']
assert col2['name'] is None
assert col2['field_name'] is None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So in principle, IMO this should be the string "None" to make field_name perfectly usable for schema lookup.
Otherwise you need to do schema.get_field_index(field_name or "None")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this works without that, let me see what's happening.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're actually explicitly handling this case later on in table_to_blockmanager so this is okay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the current code works without (just as it did work before with a "name" of None) as it is handled by table_to_blockmanager.
But that means that you will always have to special case this option, and for me that should be the point of "field_name" that schema.get_field_index(field_name) is guaranteed to not error (and you thus don't have to special case None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've implemented this. Pushing it up now.

@cpcloud cpcloud force-pushed the ARROW-1895 branch 3 times, most recently from d448fe0 to 672d011 Compare December 7, 2017 14:40
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Thanks for the updates, once this is merged, I will rebase my PR #1386 against this to use the new functionality.

[['c', 'b', 'a'], [3, 2, 1]],
names=[None, 'foo']
)
).rename(columns=dict(zip(range(3), ['a', None, '__index_level_0__'])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not that important, but doing columns= ['a', None, '__index_level_0__'] inside the DataFrame call is a bit simpler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, thank you. That is much better.

@cpcloud cpcloud changed the title ARROW-1895: [Python] Add field_name to pandas index metadata ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata Dec 7, 2017

md = column_indexes['metadata']
assert len(md) == 1
assert md['encoding'] == 'UTF-8'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests are failing on this one. Maybe this is only the case for unicode and not for bytes, and thus only for not PY2 ?

@wesm
Copy link
Member

wesm commented Dec 10, 2017

I can merge this tomorrow once the build is passing, will also take a brief look through

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, let me look into the test failure

Change-Id: I8f90992bdda8aa0852e0a96d5078c3fe6df61352
@wesm
Copy link
Member

wesm commented Dec 10, 2017

I'm going to be AFK for about 5 or 6 hours, @cpcloud or @xhochy please go and ahead and merge this once the Python builds run in Travis CI so @jorisvandenbossche can rebase. I'll be back on later today to work on some other patches for 0.8.0

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM

@xhochy xhochy closed this in 501d60e Dec 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants