-
Notifications
You must be signed in to change notification settings - Fork 3.7k
ARROW-1895/ARROW-1897: [Python] Add field_name to pandas index metadata #1397
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jorisvandenbossche would you mind reviewing and making sure this jives with the discussion so far? |
One special case that I encountered in #1386 is a DataFrame with column name
So for the column, |
@jorisvandenbossche How is it possible in practice to get a |
Indeed by just constructing it and not getting it as a column from a dataframe. arrow/python/pyarrow/serialization.py Lines 194 to 195 in 1d519d8
It is also tested explicitly: https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_ipc.py#L458 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from the None column name comment, looks good to me! And resembles the discussion IMO.
Added some minor comments.
@@ -132,7 +132,7 @@ def get_extension_dtype_info(column): | |||
return physical_dtype, metadata | |||
|
|||
|
|||
def get_column_metadata(column, name, arrow_type): | |||
def get_column_metadata(column, name, arrow_type, field_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe update the docstring here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
zip(index_levels, index_types) | ||
get_column_metadata( | ||
level, | ||
name=level.name, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
column names are automatically stringified, should the same be done for index columns? (so str(level.name)
)
(although I don't find it that important to support, as it is never roundtrippable)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to preserve the None
value when parsing null
in JSON.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes, of course. (So for columns it is converted to string when not None)
python/pyarrow/pandas_compat.py
Outdated
@@ -450,9 +459,31 @@ def table_to_blockmanager(options, table, memory_pool, nthreads=1, | |||
|
|||
block_table = table | |||
|
|||
index_columns_set = frozenset(index_columns) | |||
|
|||
# 1. 'field_name' is the user-facing name of the column in an arrow Table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the comment below you already describe the logical name as the "user-facing" name ("There must be the same number of logical names (user-facing) and physical names (fields in the arrow Table)")
I would add here as well an explanation how a "logical name" should be interpreted for clarity
idx0_name, foo_name = js['index_columns'] | ||
assert idx0_name == '__index_level_0__' | ||
assert idx0['field_name'] == idx0_name | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also assert that idx0['name']
is None, or is that already tested elsewhere adequately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that
|
||
assert foo_name == '__index_level_1__' | ||
assert foo['name'] == 'foo' | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, I would assert that for the other columns, the name and field_name are equal (this is obvious from the current code, but not explicitly tested):
col1, col2, idx0, foo = js['columns']
assert col1['name'] == col1['field_name']
assert col2['name'] == col2['field_name']
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, will do
|
||
assert col1['name'] == col1['field_name'] | ||
assert col2['name'] is None | ||
assert col2['field_name'] is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So in principle, IMO this should be the string "None"
to make field_name
perfectly usable for schema lookup.
Otherwise you need to do schema.get_field_index(field_name or "None")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this works without that, let me see what's happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're actually explicitly handling this case later on in table_to_blockmanager
so this is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the current code works without (just as it did work before with a "name" of None) as it is handled by table_to_blockmanager
.
But that means that you will always have to special case this option, and for me that should be the point of "field_name" that schema.get_field_index(field_name)
is guaranteed to not error (and you thus don't have to special case None)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've implemented this. Pushing it up now.
d448fe0
to
672d011
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
Thanks for the updates, once this is merged, I will rebase my PR #1386 against this to use the new functionality.
[['c', 'b', 'a'], [3, 2, 1]], | ||
names=[None, 'foo'] | ||
) | ||
).rename(columns=dict(zip(range(3), ['a', None, '__index_level_0__']))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not that important, but doing columns= ['a', None, '__index_level_0__']
inside the DataFrame
call is a bit simpler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, thank you. That is much better.
|
||
md = column_indexes['metadata'] | ||
assert len(md) == 1 | ||
assert md['encoding'] == 'UTF-8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests are failing on this one. Maybe this is only the case for unicode and not for bytes, and thus only for not PY2
?
I can merge this tomorrow once the build is passing, will also take a brief look through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, let me look into the test failure
Change-Id: I8f90992bdda8aa0852e0a96d5078c3fe6df61352
I'm going to be AFK for about 5 or 6 hours, @cpcloud or @xhochy please go and ahead and merge this once the Python builds run in Travis CI so @jorisvandenbossche can rebase. I'll be back on later today to work on some other patches for 0.8.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM
No description provided.