Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Make pd.Index.values not writable #33001

Open
ivirshup opened this issue Mar 25, 2020 · 13 comments
Open

Idea: Make pd.Index.values not writable #33001

ivirshup opened this issue Mar 25, 2020 · 13 comments
Labels
Enhancement Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@ivirshup
Copy link
Contributor

ivirshup commented Mar 25, 2020

Code Sample, a copy-pastable example if possible

import pandas as pd

a = pd.DataFrame({"col1": [1, 2, 3]}, index=["a", "b", "c"])
b = a.copy()

b.index.values[0] = "newval"
assert a.index[0] == "newval"

Problem description

Index objects are assumed to be immutable, but wrap a mutable object. In my experience, it's pretty common to just grab the .values from pandas object. This has caused surprising behaviour in my own code, and has been discussed here before #19862.

I would like to make sure that the values numpy array for index objects has it'sWRITABLE flag set to false, to make Index object more immutable.

Does this sound reasonable? If so, I can give a PR a shot.

Expected Output

# This should have thrown an error
b.index.values[0] = "newval"

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit : None
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.3.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 1.0.3
numpy : 1.18.2
pytz : 2018.9
dateutil : 2.7.5
pip : 20.0.2
setuptools : 46.0.0
Cython : 0.29.14
pytest : 5.3.5
hypothesis : 5.6.0
sphinx : 2.4.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.4.1
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.13.0
pandas_datareader: None
bs4 : 4.8.2
bottleneck : None
fastparquet : 0.3.3
gcsfs : None
lxml.etree : 4.4.1
matplotlib : 3.2.0
numexpr : 2.7.1
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : 0.16.0
pytables : None
pytest : 5.3.5
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : 0.8.3
xarray : 0.15.0
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : 0.48.0

@ivirshup
Copy link
Contributor Author

ivirshup commented Mar 25, 2020

As is want to happen, this looks more complicated than I expected. After changing the values property to return a read only view, 222 tests fail. I've investigated a bit. These fail with one of two error messages:

ValueError: buffer source array is read-only (101 failures)

This looks like a cython issue, but I'm not that familiar with cython. I couldn't quite follow the history, but I think either cython memory view arrays don't work on read only memory, or they require a different signature to work. Not quite sure how to work around this one, but suspect there's a templating solution.

ValueError: assignment destination is read-only (120 failures)

Intervals

It looks like interval arrays are constructed out of two indexes. But interval arrays can be mutated. This is currently done by copying the Index objects, then modifying the copies inplace. This seems very fixable.

These tests include: Most of the TestSetitem classes tests, TestSeriesConvertDtypes.test_convert_dtypes

Update: This was very easy to fix.

A number of tests rely on inplace modification of index objects

For example: test_unique_null, test_fillna_null, test_unique_null, test_value_counts_null, TestFloat64Index.test_fillna, TestFloat64Index.test_hasnans_isnans

I'm thinking these should just not do that.

@jbrockmendel
Copy link
Member

Do the tracebacks show what cython functions are raising? usually that means we need to change a type annotation in that function

@ivirshup
Copy link
Contributor Author

Yes, it's a bunch of functions. I could try and collect those if that would be useful.

Here's one example:

pandas/tests/indexes/datetimes/test_join.py:42: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
pandas/core/indexes/base.py:3441: in join
    self, how=how, level=level, return_indexers=return_indexers
pandas/core/indexes/datetimelike.py:856: in join
    sort=sort,
pandas/core/indexes/base.py:3451: in join
    return this.join(other, how=how, return_indexers=return_indexers)
pandas/core/indexes/base.py:3471: in join
    other, how=how, return_indexers=return_indexers
pandas/core/indexes/base.py:3778: in _join_monotonic
    lidx = self._left_indexer_unique(ov, sv)
pandas/core/indexes/base.py:239: in _left_indexer_unique
    return libjoin.left_join_indexer_unique(left, right)
pandas/_libs/join.pyx:270: in pandas._libs.join.left_join_indexer_unique
    def left_join_indexer_unique(join_t[:] left, join_t[:] right):
stringsource:658: in View.MemoryView.memoryview_cwrapper
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

>   ???
E   ValueError: buffer source array is read-only

stringsource:349: ValueError

So I thought this could be fixable by changing the signature from

left_join_indexer_unique(join_t[:] left, join_t[:] right)

to

left_join_indexer_unique(const join_t[:] left, const join_t[:] right)

But it turns out doing that raises this fun exception during compilation:

Exception: This may never happen, please report a bug
Full compilation traceback
Compiling pandas/_libs/join.pyx because it changed.
[1/1] Cythonizing pandas/_libs/join.pyx

Error compiling Cython file:
------------------------------------------------------------
...

# right might contain non-unique values

@cython.wraparound(False)
@cython.boundscheck(False)
def left_join_indexer_unique(const join_t[:] left, const join_t[:] right):
                                         ^
------------------------------------------------------------

pandas/_libs/join.pyx:280:42: Compiler crash in AnalyseDeclarationsTransform

File 'ModuleNode.py', line 124, in analyse_declarations: ModuleNode(join.pyx:1:0,
    full_module_name = 'pandas._libs.join')
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:1:0)
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:280:0)
File 'Nodes.py', line 375, in analyse_declarations: CompilerDirectivesNode(join.pyx:280:0)
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:280:0)
File 'Nodes.py', line 2887, in analyse_declarations: DefNode(join.pyx:280:0,
    modifiers = [...]/0,
    name = 'left_join_indexer_unique',
    num_required_args = 2,
    outer_attrs = [...]/2,
    py_wrapper_required = True,
    reqd_kw_flags_cname = '0')
File 'Nodes.py', line 2925, in analyse_argument_types: DefNode(join.pyx:280:0,
    modifiers = [...]/0,
    name = 'left_join_indexer_unique',
    num_required_args = 2,
    outer_attrs = [...]/2,
    py_wrapper_required = True,
    reqd_kw_flags_cname = '0')
File 'Nodes.py', line 1093, in analyse: MemoryViewSliceTypeNode(join.pyx:280:42,
    name = 'memoryview')

Compiler crash traceback from this point on:
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/Nodes.py", line 1093, in analyse
    self.type = PyrexTypes.MemoryViewSliceType(base_type, axes_specs)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 625, in __init__
    self.dtype_name = Buffer.mangle_dtype_name(self.dtype)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/Buffer.py", line 631, in mangle_dtype_name
    return prefix + dtype.specialization_name()
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 57, in specialization_name
    common_subs = (self.empty_declaration_code()
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 51, in empty_declaration_code
    self._empty_declaration = self.declaration_code('')
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 1582, in declaration_code
    return self.const_base_type.declaration_code("const %s" % entity_code, for_display, dll_linkage, pyrex)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 1649, in declaration_code
    raise Exception("This may never happen, please report a bug")
Exception: This may never happen, please report a bug
Traceback (most recent call last):
  File "setup.py", line 791, in <module>
    setup_package()
  File "setup.py", line 761, in setup_package
    ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
  File "setup.py", line 534, in maybe_cythonize
    return cythonize(extensions, *args, **kwargs)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 1102, in cythonize
    cythonize_one(*args)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 1225, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: pandas/_libs/join.pyx
(pandas-dev) isaac@Mimir:~/github/pandas ‹master›
$ python3 setup.py build_ext --inplace -j 12
Compiling pandas/_libs/join.pyx because it changed.
[1/1] Cythonizing pandas/_libs/join.pyx

Error compiling Cython file:
------------------------------------------------------------
...

# right might contain non-unique values

@cython.wraparound(False)
@cython.boundscheck(False)
def left_join_indexer_unique(const join_t[:] left, const join_t[:] right):
                                         ^
------------------------------------------------------------

pandas/_libs/join.pyx:270:42: Compiler crash in AnalyseDeclarationsTransform

File 'ModuleNode.py', line 124, in analyse_declarations: ModuleNode(join.pyx:1:0,
    full_module_name = 'pandas._libs.join')
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:1:0)
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:270:0)
File 'Nodes.py', line 375, in analyse_declarations: CompilerDirectivesNode(join.pyx:270:0)
File 'Nodes.py', line 431, in analyse_declarations: StatListNode(join.pyx:270:0)
File 'Nodes.py', line 2887, in analyse_declarations: DefNode(join.pyx:270:0,
    modifiers = [...]/0,
    name = 'left_join_indexer_unique',
    num_required_args = 2,
    outer_attrs = [...]/2,
    py_wrapper_required = True,
    reqd_kw_flags_cname = '0')
File 'Nodes.py', line 2925, in analyse_argument_types: DefNode(join.pyx:270:0,
    modifiers = [...]/0,
    name = 'left_join_indexer_unique',
    num_required_args = 2,
    outer_attrs = [...]/2,
    py_wrapper_required = True,
    reqd_kw_flags_cname = '0')
File 'Nodes.py', line 1093, in analyse: MemoryViewSliceTypeNode(join.pyx:270:42,
    name = 'memoryview')

Compiler crash traceback from this point on:
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/Nodes.py", line 1093, in analyse
    self.type = PyrexTypes.MemoryViewSliceType(base_type, axes_specs)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 625, in __init__
    self.dtype_name = Buffer.mangle_dtype_name(self.dtype)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/Buffer.py", line 631, in mangle_dtype_name
    return prefix + dtype.specialization_name()
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 57, in specialization_name
    common_subs = (self.empty_declaration_code()
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 51, in empty_declaration_code
    self._empty_declaration = self.declaration_code('')
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 1582, in declaration_code
    return self.const_base_type.declaration_code("const %s" % entity_code, for_display, dll_linkage, pyrex)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Compiler/PyrexTypes.py", line 1649, in declaration_code
    raise Exception("This may never happen, please report a bug")
Exception: This may never happen, please report a bug
Traceback (most recent call last):
  File "setup.py", line 791, in <module>
    setup_package()
  File "setup.py", line 761, in setup_package
    ext_modules=maybe_cythonize(extensions, compiler_directives=directives),
  File "setup.py", line 534, in maybe_cythonize
    return cythonize(extensions, *args, **kwargs)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 1102, in cythonize
    cythonize_one(*args)
  File "/Users/isaac/miniconda3/envs/pandas-dev/lib/python3.7/site-packages/Cython/Build/Dependencies.py", line 1225, in cythonize_one
    raise CompileError(None, pyx_file)
Cython.Compiler.Errors.CompileError: pandas/_libs/join.pyx

@shoyer
Copy link
Member

shoyer commented Mar 26, 2020

There was a long-standing issue in Cython where it didn't used to support read-only memory views, but that was fixed a few years ago now:
cython/cython#1605
https://cython.readthedocs.io/en/latest/src/userguide/memoryviews.html#read-only-views

I don't know what minimum Cython version is required these days to build pandas from source, but it seems like it should probably be OK to use read-only views in pandas now?

@ivirshup Were you using at least Cython version 0.28 when you encountered the error?

@ivirshup
Copy link
Contributor Author

ivirshup commented Mar 26, 2020

@shoyer yes, this was using cython v0.29.16 (I'm using the pandas/environment.yaml conda env).

Those are the docs that made me think I could just chuck a const in front of the signature.

It looks like @jbrockmendel may have run into this issue before cython/cython#3222.

@jbrockmendel
Copy link
Member

For fused types like join_t cython doesnt support const memoryviews until 0.30, so we're eagerly awaiting that

@ivirshup
Copy link
Contributor Author

I see that this is fixed in: cython/cython#3118. It looks like this was not considered a bug fix.

Would pandas put a hard restriction on cython >=0.30.0 (it looks like this might be >=3) once it releases, or do you aim to support older cython versions as well?

@jbrockmendel
Copy link
Member

or do you aim to support older cython versions as well?

We distribute wheels, so don't have cython as an install requirement. we are pretty aggressive in bumping it as a dev requirement.

I'm also looking forward to conditional nogil in 0.30

@TomAugspurger
Copy link
Contributor

Regardless of the Cython issues, is this something we actually want to do? What are the implications of this? When we create a Series from an Index (say via a .reset_index()) would the Series data be writable?

Keep in mind that we'd like to support indexes backed by non-ndarrays some day, which won't necessarily have the same API / capability for marking data as non-writable.

@ivirshup
Copy link
Contributor Author

ivirshup commented Mar 27, 2020

If you make a series from an index, you should be making a copy of the underlying data, so yes it would be writable.

I think the current sharing of references, with the false assumption of immutability, can lead to some very bad behavior. I think if pd.Index objects are going to be treated as immutable, their public methods and attributes should enforce that.

One solution for array classes which do not support immutability would be to return an instance of an immutable view wrapper class. Another would be to return copies of the data from .values and .array.

Update: I've opened a PR for this, mainly so it's easy to share what the barriers are.

@shoyer
Copy link
Member

shoyer commented Mar 27, 2020

I think the current sharing of references, with the false assumption of immutability, can lead to some very bad behavior. I think if pd.Index objects are going to be treated as immutable, their public methods and attributes should enforce that.

Strong 👍 from me. "False immutability" has resulted in a number of subtle downstream bugs in xarray, for example.

I think it would be better to make an extra array copy when converting from an Index to a Series than to allow modifying the series to break the values. Ideally we would have copy-on-write, but that's a bigger challenge.

@jbrockmendel jbrockmendel added the Index Related to the Index class or subclasses label Apr 1, 2020
@ivirshup
Copy link
Contributor Author

@TomAugspurger, did you have more thoughts on the feasibility/ desirability of immutability enforcement? I wasn't sure if I should be putting more effort into this yet.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Apr 16, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Index Related to the Index class or subclasses Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants