Description
Code Sample, a copy-pastable example if possible
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: pd.__version__
Out[3]: u'0.19.0'
In [4]: mi = pd.MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))
In [5]: df = pd.DataFrame([[1, 2], [3, 4]], mi)
In [6]: df.sort_index(na_position="first")
Out[6]:
0 1
A B C
1 1 NaN 3 4
3 1 2
In [7]: df.sort_index(na_position="last")
Out[7]:
0 1
A B C
1 1 NaN 3 4
3 1 2
Problem description
The na_position
argument isn't used in DataFrame.sort_index()
or Series.sort_index()
due to the way we sort the MultiIndex
. Whenever we create a MultiIndex
, we store the labels as relative values. For instance, if we have the following MultiIndex
:
MultiIndex.from_tuples([[1, 1, 3], [1, 1, np.nan]], names=list('ABC'))
the values get stored as
MultiIndex(levels=[[1], [1], [3]],
labels=[[0, 0], [0, 0], [0, -1]],
names=[u'A', u'B', u'C'])
with a NaN
placeholder of -1.
These label values are what get passed to the sorting algorithm for both DataFrames and Series. Since the sorting only happens on the labels
, it has no notion of the NaN
.
This has been discussed in #14015 and #14672 .
My original naive solution was to change these lines from:
indexer = _lexsort_indexer(labels.labels, orders=ascending,
na_position=na_position)
to
index_values_list = np.dstack(labels.get_values())[0].tolist()
indexer = _lexsort_indexer(index_values_list, orders=ascending,
na_position=na_position)
This didn't break any tests, but it isn't necessarily the best approach.
Expected Output
In [7]: df.sort_index(na_position="last")
Out[7]:
0 1
A B C
1 1 3 1 2
NaN 3 4
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.12.final.0
python-bits: 64
OS: Linux
OS-release: 3.16.0-77-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: None.None
pandas: 0.19.0
nose: 1.3.4
pip: 9.0.0
setuptools: 27.2.0
Cython: 0.21
numpy: 1.11.2
scipy: 0.16.1
statsmodels: 0.6.1
xarray: None
IPython: 4.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.5.3
pytz: 2016.7
blosc: None
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.5.0
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.5
sqlalchemy: 0.9.7
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
jinja2: 2.7.3
boto: 2.32.1
pandas_datareader: None