Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Series.reindex() does nothing #17132

Open
dniku opened this issue Jul 31, 2017 · 20 comments
Open

Series.reindex() does nothing #17132

dniku opened this issue Jul 31, 2017 · 20 comments
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type

Comments

@dniku
Copy link

dniku commented Jul 31, 2017

Code Sample, a copy-pastable example if possible

import numpy as np
import pandas as pd

np.random.seed(42)

nums = np.random.choice(range(0, 999), 1000)
cut = pd.cut(nums, 10)

discretization = cut.value_counts()
discretization.sort_index(inplace=True)

intervals = list(discretization.index)
mids = [i.mid for i in intervals]

print(discretization.reindex(index=mids))

Output:

(-0.998, 99.8]    100
(99.8, 199.6]     103
(199.6, 299.4]     97
(299.4, 399.2]     94
(399.2, 499.0]     97
(499.0, 598.8]     85
(598.8, 698.6]    118
(698.6, 798.4]    103
(798.4, 898.2]    108
(898.2, 998.0]     95
dtype: int64

Problem description

Series.reindex() returns the original Series even though the index is changed.

Expected Output

print(pd.Series(discretization.values, index=mids)) produces:

49.401     100
149.700    103
249.500     97
349.300     94
449.100     97
548.900     85
648.700    118
748.500    103
848.300    108
948.100     95
dtype: int64

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.9.39-1-MANJARO
machine: x86_64
processor:
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.1.3
pip: 9.0.1
setuptools: 36.2.2
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 5.3.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: 0.4.0

@gfyoung gfyoung added the Indexing Related to indexing on series/frames, not to indexes themselves label Jul 31, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

@Pastafarianist : Thanks for the report! Unfortunately, we can't run this code because discretization_classes is not defined. Could you provide a reproducible code example?

In addition, if you could copy + paste the output (both actual and expected), that would be useful for anyone who's reading these issues.

@dniku
Copy link
Author

dniku commented Jul 31, 2017

Apologies; please set it to any integer like 10. I've updated the code in the issue description.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

I've updated the code in the issue description.

Could you move whichever is your expected output under your sentence regarding the "second print" ?

@dniku
Copy link
Author

dniku commented Jul 31, 2017

Done.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

Did you reverse the two? The expected output should be at the bottom, but I think you put it second-to-last if I understand the issue properly.

@dniku
Copy link
Author

dniku commented Jul 31, 2017

No, I did not. The issue description reads correctly right now.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

Ah, actually, yes I see that now. One other thing: you're missing a definition for intervals in your code. Could you add a definition for that?

@dniku
Copy link
Author

dniku commented Jul 31, 2017

So sorry for that. I was copy-pasting from a Jupyter Notebook. I've updated the description and made sure that the code runs as-is.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

No worries! Can confirm the code is runnable standalone.

That's odd...not sure why you can't reindex when the index are Interval. This re-indexing works if the index is a sequence of integers. Marking as a bug unless otherwise explained. PR is welcome!

@gfyoung gfyoung added the Bug label Jul 31, 2017
@jreback
Copy link
Contributor

jreback commented Jul 31, 2017

@Pastafarianist not sure what you think this should do. .reindex matches up the existing with new. Reindexing with floats is meaningless with an IntervalIndex (which is the index backing a CategoricalIndex).

do you simply want

In [3]: discretization.index = mids

In [4]: discretization
Out[4]: 
49.401     100
149.700    103
249.500     97
349.300     94
449.100     97
548.900     85
648.700    118
748.500    103
848.300    108
948.100     95
dtype: int64

This is reindexing.

In [7]: discretization.reindex(discretization.index[[0, 2, 5]])
Out[7]: 
(-0.998, 99.8]    100
(199.6, 299.4]     97
(499.0, 598.8]     85
dtype: int64

@jreback jreback closed this as completed Jul 31, 2017
@jreback jreback added this to the No action milestone Jul 31, 2017
@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

@jreback : Ah, I see. I also had the implementation mixed up in my mind, but given the behavior, why don't we just get a Series of NaN ?

@jorisvandenbossche
Copy link
Member

IMO this is certainly a bug. It should never return the original series, as it should either do an actual reindex (if we decide that eg 1 matches Interval(0, 2)) or if no matches our found indeed return a Series of all NaN.
The reindex method should always return an object with the new index that you passed.

@jorisvandenbossche jorisvandenbossche modified the milestones: Next Major Release, No action Aug 1, 2017
@jorisvandenbossche
Copy link
Member

Actually, on second look, it is reindexing, it just looks identical because you get the same values back (it just doing the same as discretization.loc[mids].
This is another example of the discussion we had yesterday of differences between loc and reindex. IMO the expected output of @Pastafarianist is more logical (more in line of what reindex is supposed to do).

@jreback
Copy link
Contributor

jreback commented Aug 1, 2017

IMO the expected output of @Pastafarianist is more logical (more in line of what reindex is supposed to do).

not at all. This is a correct result. The points happen to be reindexer of the intervals. E.g.

Not very useful, but the first 2 intervals are picked out because the contain the points 20 and 200. The 2000 gets nan because its not found.

In [13]: discretization.reindex(index=[20, 200, 2000])
Out[13]: 
(-0.998, 99.8]    100.0
(199.6, 299.4]     97.0
2000                NaN
dtype: float64

@jreback jreback added the Interval Interval data type label Aug 1, 2017
@jreback
Copy link
Contributor

jreback commented Aug 1, 2017

xref #16386

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Aug 1, 2017

@jreback can you then explain the docstring? (emphasis mine)

Conform Series to new index with optional filling logic, placing
NA/NaN in locations having no value in the previous index. A new object
is produced unless the new index is equivalent to the current one
and
copy=False

(the docstring could also have been incorrect for a long time of course)

I would argue that the IntervalIndex and the integers (the mids) are not equivalent in this case

@jreback
Copy link
Contributor

jreback commented Aug 1, 2017

what part is not correct? seems ok to me.

I would argue that the IntervalIndex and the integers (the mids) are not equivalent in this case

this is correct based on what Interval selection does. selection in an interval select an interval. Hence you get back the original index.

@gfyoung
Copy link
Member

gfyoung commented Aug 1, 2017

But they are not identical though, which is @jorisvandenbossche point, and if your explanation is expected behavior, I don't see that in the docs.

@ceciliassis
Copy link

ceciliassis commented Oct 10, 2018

Sorry for the late reply. but i seam to have the same problem with INT indexes.
The code:

import math
from IPython import display
from matplotlib import cm
from matplotlib import gridspec
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import metrics
import tensorflow as tf
from tensorflow.python.data import Dataset

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
print(california_housing_dataframe.head())

california_housing_dataframe = california_housing_dataframe.reindex(np.random.permutation(california_housing_dataframe.index).tolist())
print(california_housing_dataframe.head())

Whenever I use the numpy permutation the result goes wrong. I've tried shell, jupyter notebook and script.

Here's the output
screenshot from 2018-10-10 11-15-59

Configs

Python version: 2.7 and 3.6
Pandas version: 0.22.0 and 0.23.6
Conda: Anaconda3

@jorisvandenbossche
Copy link
Member

@ceciliassis It's not directly clear what is wrong in your output, or how it is related to this issue. If you think there is a bug, please open a new issue.

@boldt boldt mentioned this issue Oct 12, 2018
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Indexing Related to indexing on series/frames, not to indexes themselves Interval Interval data type
Projects
None yet
Development

No branches or pull requests

6 participants