ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata #1397

cpcloud · 2017-12-06T20:04:25Z

No description provided.

wesm · 2017-12-06T21:06:28Z

@jorisvandenbossche would you mind reviewing and making sure this jives with the discussion so far?

jorisvandenbossche · 2017-12-06T22:15:19Z

One special case that I encountered in #1386 is a DataFrame with column name None (from ipc when serializing a Series without name).
This case is not yet handled here:

In [6]: pa.Table.from_pandas(pd.DataFrame({None: [1,2,3]}))
Out[6]: 
pyarrow.Table
None: int64
__index_level_0__: int64
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
            b'me": null, "pandas_type": "mixed", "numpy_type": "object", "meta'
            b'data": null}], "columns": [{"name": null, "field_name": null, "p'
            b'andas_type": "int64", "numpy_type": "int64", "metadata": null}, '
            b'{"name": null, "field_name": "__index_level_0__", "pandas_type":'
            b' "int64", "numpy_type": "int64", "metadata": null}], "pandas_ver'
            b'sion": "0.22.0.dev0+260.g5da3759"}'}

So for the column, "name": null, "field_name": null, are both null, while field_name should be "None"

cpcloud · 2017-12-06T22:37:45Z

@jorisvandenbossche How is it possible in practice to get a Series with a name of None? How can one exist without explicitly constructing it or assigning its name to be None?

jorisvandenbossche · 2017-12-06T22:47:19Z

Indeed by just constructing it and not getting it as a column from a dataframe.
Such a dataframe as I show above is constructed in the ipc code to serialize Series objects:

arrow/python/pyarrow/serialization.py

Lines 194 to 195 in 1d519d8

    
           def _serialize_pandas_series(obj): 
        
               return _serialize_pandas_dataframe(pd.DataFrame({obj.name: obj}))

It is also tested explicitly: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L458

jorisvandenbossche

Apart from the None column name comment, looks good to me! And resembles the discussion IMO.

Added some minor comments.

jorisvandenbossche · 2017-12-06T22:36:30Z

python/pyarrow/pandas_compat.py

@@ -132,7 +132,7 @@ def get_extension_dtype_info(column):
    return physical_dtype, metadata


-def get_column_metadata(column, name, arrow_type):
+def get_column_metadata(column, name, arrow_type, field_name):


maybe update the docstring here as well

Thanks, will do

jorisvandenbossche · 2017-12-06T22:51:26Z

python/pyarrow/pandas_compat.py

-                zip(index_levels, index_types)
+            get_column_metadata(
+                level,
+                name=level.name,


column names are automatically stringified, should the same be done for index columns? (so str(level.name))
(although I don't find it that important to support, as it is never roundtrippable)

This is to preserve the None value when parsing null in JSON.

Ah yes, of course. (So for columns it is converted to string when not None)

jorisvandenbossche · 2017-12-06T22:56:40Z

python/pyarrow/pandas_compat.py

@@ -450,9 +459,31 @@ def table_to_blockmanager(options, table, memory_pool, nthreads=1,

    block_table = table

+    index_columns_set = frozenset(index_columns)
+
+    # 1. 'field_name' is the user-facing name of the column in an arrow Table


in the comment below you already describe the logical name as the "user-facing" name ("There must be the same number of logical names (user-facing) and physical names (fields in the arrow Table)")
I would add here as well an explanation how a "logical name" should be interpreted for clarity

jorisvandenbossche · 2017-12-06T23:01:54Z

python/pyarrow/tests/test_convert_pandas.py

+        idx0_name, foo_name = js['index_columns']
+        assert idx0_name == '__index_level_0__'
+        assert idx0['field_name'] == idx0_name
+


Should we also assert that idx0['name'] is None, or is that already tested elsewhere adequately?

I'll add that

jorisvandenbossche · 2017-12-06T23:04:05Z

python/pyarrow/tests/test_convert_pandas.py

+
+        assert foo_name == '__index_level_1__'
+        assert foo['name'] == 'foo'
+


For completeness, I would assert that for the other columns, the name and field_name are equal (this is obvious from the current code, but not explicitly tested):

col1, col2, idx0, foo = js['columns'] assert col1['name'] == col1['field_name'] assert col2['name'] == col2['field_name']

Yep, will do

jorisvandenbossche · 2017-12-06T23:46:11Z

python/pyarrow/tests/test_convert_pandas.py

+
+        assert col1['name'] == col1['field_name']
+        assert col2['name'] is None
+        assert col2['field_name'] is None


So in principle, IMO this should be the string "None" to make field_name perfectly usable for schema lookup.
Otherwise you need to do schema.get_field_index(field_name or "None")

Hm, this works without that, let me see what's happening.

We're actually explicitly handling this case later on in table_to_blockmanager so this is okay.

Yes, the current code works without (just as it did work before with a "name" of None) as it is handled by table_to_blockmanager.
But that means that you will always have to special case this option, and for me that should be the point of "field_name" that schema.get_field_index(field_name) is guaranteed to not error (and you thus don't have to special case None)

Ok, I've implemented this. Pushing it up now.

jorisvandenbossche

Looks good!

Thanks for the updates, once this is merged, I will rebase my PR #1386 against this to use the new functionality.

jorisvandenbossche · 2017-12-07T15:59:21Z

python/pyarrow/tests/test_convert_pandas.py

+                [['c', 'b', 'a'], [3, 2, 1]],
+                names=[None, 'foo']
+            )
+        ).rename(columns=dict(zip(range(3), ['a', None, '__index_level_0__'])))


not that important, but doing columns= ['a', None, '__index_level_0__'] inside the DataFrame call is a bit simpler

Yep, thank you. That is much better.

jorisvandenbossche · 2017-12-08T12:41:56Z

python/pyarrow/tests/test_convert_pandas.py

+
+        md = column_indexes['metadata']
+        assert len(md) == 1
+        assert md['encoding'] == 'UTF-8'


Tests are failing on this one. Maybe this is only the case for unicode and not for bytes, and thus only for not PY2 ?

wesm · 2017-12-10T04:58:02Z

I can merge this tomorrow once the build is passing, will also take a brief look through

wesm

+1, let me look into the test failure

Change-Id: I8f90992bdda8aa0852e0a96d5078c3fe6df61352

wesm · 2017-12-10T15:40:45Z

I'm going to be AFK for about 5 or 6 hours, @cpcloud or @xhochy please go and ahead and merge this once the Python builds run in Travis CI so @jorisvandenbossche can rebase. I'll be back on later today to work on some other patches for 0.8.0

xhochy

+1, LGTM

wesm mentioned this pull request Dec 6, 2017

DOC: Update parquet metadata format description around index levels pandas-dev/pandas#18201

Merged

jorisvandenbossche reviewed Dec 6, 2017

View reviewed changes

cpcloud force-pushed the ARROW-1895 branch 3 times, most recently from d448fe0 to 672d011 Compare December 7, 2017 14:40

jorisvandenbossche approved these changes Dec 7, 2017

View reviewed changes

cpcloud force-pushed the ARROW-1895 branch from a10cae1 to 72ddcc2 Compare December 7, 2017 16:10

cpcloud changed the title ~~ARROW-1895: [Python] Add field_name to pandas index metadata~~ ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata Dec 7, 2017

TomAugspurger mentioned this pull request Dec 8, 2017

Compat for pyarrow 0.8.0 dask/dask#2973

Merged

jorisvandenbossche reviewed Dec 8, 2017

View reviewed changes

cpcloud added 8 commits December 9, 2017 10:44

ARROW-1895: [Python] Add field_name to pandas index metadata

20bf15a

Use categorical codes instead of object

f570871

Fix string vs unicode in column_indexes and add field_name as well

37dca10

Use field_name to map arrow table field names to pandas names

cf52001

Cleaner construction

3c41905

Unicode bytes difference

891671b

Operator precedence

3bc30fd

Fix py2 test

3f7760f

cpcloud force-pushed the ARROW-1895 branch from e1917a7 to 3f7760f Compare December 9, 2017 15:49

wesm mentioned this pull request Dec 9, 2017

ARROW-1883: [Python] Fix handling of metadata in to_pandas when not all columns are present #1386

Closed

wesm approved these changes Dec 10, 2017

View reviewed changes

Extra metadata is None in py2

1293b24

Change-Id: I8f90992bdda8aa0852e0a96d5078c3fe6df61352

xhochy approved these changes Dec 10, 2017

View reviewed changes

xhochy closed this in 501d60e Dec 10, 2017


		assert foo_name == '__index_level_1__'
		assert foo['name'] == 'foo'

ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata #1397

ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata #1397

Uh oh!

Conversation

cpcloud commented Dec 6, 2017

Uh oh!

wesm commented Dec 6, 2017

Uh oh!

jorisvandenbossche commented Dec 6, 2017

Uh oh!

cpcloud commented Dec 6, 2017

Uh oh!

jorisvandenbossche commented Dec 6, 2017

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wesm commented Dec 10, 2017

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

wesm commented Dec 10, 2017

Uh oh!

xhochy left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!