Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
class MySeq(pd.Series):
_metadata = ['property']
@property
def _constructor(self):
return MySeq
seq = MySeq([*'abc'], name='data')
assert seq.name == 'data'
assert seq[1:2].name == 'data'
assert seq[[1, 2]].name is None
assert seq.drop_duplicates().name is None
Issue Description
pandas 2.2.3
Let’s consider two variants of defining a custom subtype of pandas.Series
. In the first one, no custom properties are added, while in the second one, custom metadata is included:
import pandas as pd
class MySeries(pd.Series):
@property
def _constructor(self):
return MySeries
seq = MySeries([*'abc'], name='data')
print(f'''Case without _metadata:
{isinstance(seq[0:1], MySeries) = }
{isinstance(seq[[0, 1]], MySeries) = }
{seq[0:1].name = }
{seq[[0, 1]].name = }
''')
class MySeries(pd.Series):
_metadata = ['property']
@property
def _constructor(self):
return MySeries
seq = MySeries([*'abc'], name='data')
seq.property = 'MyProperty'
print(f'''Case with defined _metadata:
{isinstance(seq[0:1], MySeries) = }
{isinstance(seq[[0, 1]], MySeries) = }
{seq[0:1].name = }
{seq[[0, 1]].name = }
{getattr(seq[0:1], 'property', 'NA') = }
{getattr(seq[[0, 1]], 'property', 'NA') = }
''')
The output of the code above will be:
Case without _metadata:
isinstance(seq[0:1], MySeries) = True
isinstance(seq[[0, 1]], MySeries) = True
seq[0:1].name = 'data'
seq[[0, 1]].name = 'data'
Case with defined _metadata:
isinstance(seq[0:1], MySeries) = True
isinstance(seq[[0, 1]], MySeries) = True
seq[0:1].name = 'data'
seq[[0, 1]].name = None <<< Problematic result of indexing
getattr(seq[0:1], 'property', 'NA') = 'MyProperty'
getattr(seq[[0, 1]], 'property', 'NA') = 'MyProperty'
So, if _metadata
is defined, the sequence name is preserved when slicing, but lost when indexing with a list, whereas without _metadata
the name is preserved in both cases.
As a workaround we can add 'name'
to _metadata
:
class MySeries(pd.Series):
_metadata = ['property', 'name']
@property
def _constructor(self):
return MySeries
seq = MySeries([*'abc'], name='data')
assert seq[0:1].name == 'data'
assert seq[[0, 1]].name == 'data'
However, I'm not sure if there's no deferred issues caused by treating name
as a metadata attribute.
The problem arose when applying PyJanitor methods to user-defined DataFrames with _metadata
. Specifically, drop_duplicates
was applied to a separate column, followed by an attempt to access its name
in order to combine the result into a new DataFrame.
Expected Behavior
import pandas as pd
class MySeq(pd.Series):
_metadata = ['property']
@property
def _constructor(self):
return MySeq
seq = MySeq([*'abc'], name='data')
assert seq[[1, 2]].name == 'data'
assert seq.drop_duplicates().name == 'data'
Installed Versions
INSTALLED VERSIONS
commit : cfe54bd
python : 3.13.2
python-bits : 64
OS : Linux
OS-release : 4.15.0-213-generic
Version : #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 3.0.0.dev0+2124.gcfe54bd5da
numpy : 2.3.0.dev0+git20250304.6611d55
dateutil : 2.9.0.post0
pip : 24.3.1
Cython : None
sphinx : None
IPython : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
fastparquet : None
fsspec : None
html5lib : None
hypothesis : None
gcsfs : None
jinja2 : None
lxml.etree : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
psycopg2 : None
pymysql : None
pyarrow : None
pyiceberg : None
pyreadstat : None
pytest : None
python-calamine : None
pytz : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlsxwriter : None
zstandard : None
tzdata : 2025.2
qtpy : None
pyqt5 : None