Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: 1.2.5 -> 1.3.x breaks yaml.dump(DataFrameObject) #42748

Closed
2 of 3 tasks
gunthergl opened this issue Jul 27, 2021 · 13 comments · Fixed by #44137
Closed
2 of 3 tasks

BUG: 1.2.5 -> 1.3.x breaks yaml.dump(DataFrameObject) #42748

gunthergl opened this issue Jul 27, 2021 · 13 comments · Fixed by #44137
Labels
Bug Compat pandas objects compatability with Numpy or Python functions
Milestone

Comments

@gunthergl
Copy link

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

import pandas as pd

tmp = pd.DataFrame(
        {'col_name': [1,2,3,4]}
)

import yaml
dumped = yaml.dump(tmp)  # Fails starting with pandas==1.3.0
print(dumped)

Problem description

Another package, pytorch-lightning, has the option to save all given hyperparameters. They yaml all given parameters, which resulted in the kind of ugly output shown in the expected output for dataframes. Starting with pandas==1.3.0 this breaks because of the MWE shown above.

If this is expected behaviour from pandas I am sorry, then the issue should go to pytorch-lightning probably.

Expected Output

    !!python/object:pandas.core.frame.DataFrame
    _flags:
      allows_duplicate_labels: true
    _metadata: []
    _mgr: !!python/object/new:pandas.core.internals.managers.BlockManager
      state: !!python/tuple
      - &id004
        - !!python/object/apply:pandas.core.indexes.base._new_Index
          - &id002 !!python/name:pandas.core.indexes.base.Index ''
          - data: !!python/object/apply:numpy.core.multiarray._reconstruct
              args:
              - &id001 !!python/name:numpy.ndarray ''
              - !!python/tuple
                - 0
              - !!binary |
                Yg==
              state: !!python/tuple
              - 1
              - !!python/tuple
                - 1
              - &id003 !!python/object/apply:numpy.dtype
                args:
                - O8
                - false
                - true
                state: !!python/tuple
                - 3
                - '|'
                - null
                - null
                - null
                - -1
                - -1
                - 63
              - false
              - - col_name
            name: null
        - !!python/object/apply:pandas.core.indexes.base._new_Index
          - !!python/name:pandas.core.indexes.range.RangeIndex ''
          - name: null
            start: 0
            step: 1
            stop: 4
      - - &id005 !!python/object/apply:numpy.core.multiarray._reconstruct
          args:
          - *id001
          - !!python/tuple
            - 0
          - !!binary |
            Yg==
          state: !!python/tuple
          - 1
          - !!python/tuple
            - 1
            - 4
          - !!python/object/apply:numpy.dtype
            args:
            - i8
            - false
            - true
            state: !!python/tuple
            - 3
            - <
            - null
            - null
            - null
            - -1
            - -1
            - 0
          - false
          - !!binary |
            AQAAAAAAAAACAAAAAAAAAAMAAAAAAAAABAAAAAAAAAA=
      - - !!python/object/apply:pandas.core.indexes.base._new_Index
          - *id002
          - data: !!python/object/apply:numpy.core.multiarray._reconstruct
              args:
              - *id001
              - !!python/tuple
                - 0
              - !!binary |
                Yg==
              state: !!python/tuple
              - 1
              - !!python/tuple
                - 1
              - *id003
              - false
              - - col_name
            name: null
      - 0.14.1:
          axes: *id004
          blocks:
          - mgr_locs: !!python/object/apply:builtins.slice
            - 0
            - 1
            - 1
            values: *id005
    _typ: dataframe
    attrs: {}

Output of pd.show_versions()

pandas 1.2.5:

INSTALLED VERSIONS

commit : 7c48ff4
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252

pandas : 1.2.5
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 3.5.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

pandas 1.3.1:

INSTALLED VERSIONS

commit : c7f7443
python : 3.8.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.19042
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : de_DE.cp1252

pandas : 1.3.1
numpy : 1.21.1
pytz : 2021.1
dateutil : 2.8.2
pip : 21.1.3
setuptools : 52.0.0.post20210125
Cython : None
pytest : 6.2.4
hypothesis : None
sphinx : 3.5.4
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.6.3
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.0.1
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 2021.07.0
fastparquet : None
gcsfs : None
matplotlib : 3.4.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : 1.6.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

@gunthergl gunthergl added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 27, 2021
@MarcoGorelli
Copy link
Member

Thanks @gunthergl for the report - I just left git bisect running and #40842 was found to be the first bad commit (cc @jbrockmendel ). Haven't looked into it further though

@jreback
Copy link
Contributor

jreback commented Jul 27, 2021

if yaml is looking at the internal structures then they need to update the package - these are by definition private and can /
do break

@jbrockmendel
Copy link
Member

xref #40226

@simonjayhawkins
Copy link
Member

close as won't fix? or add 1.3.2 milestone to track discussion?

@simonjayhawkins simonjayhawkins added Closing Candidate May be closeable, needs more eyeballs Compat pandas objects compatability with Numpy or Python functions Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jul 28, 2021
@jreback
Copy link
Contributor

jreback commented Jul 28, 2021

close as won't fix as no details on what is broken

@simonjayhawkins
Copy link
Member

Thanks @gunthergl for the report. have closed as upstream issue.

@simonjayhawkins simonjayhawkins added Upstream issue Issue related to pandas dependency and removed Closing Candidate May be closeable, needs more eyeballs Needs Discussion Requires discussion from core team before further action Upstream issue Issue related to pandas dependency labels Jul 28, 2021
@simonjayhawkins
Copy link
Member

Thanks @gunthergl for the report. have closed as upstream issue.

sorry. downstream issue.

@nitzmahone
Copy link

nitzmahone commented Jul 29, 2021

The underlying problem here that's making pyyaml's default representer blow up is that the __reduce__ impl of one of these objects changed recently to return a partial for the factory function; I assume that's related to the Cythonizing of one of those types I saw referenced above. Basically if reduce doesn't return a function whose name we can import and call, pyyaml won't be able to serialize those objects anymore without a custom dumper (and I'd guess this is a problem for other things as well).

yaml/pyyaml#541 (comment)

@jbrockmendel
Copy link
Member

fair enough. PR to avoid the partial would be welcome

@jbrockmendel jbrockmendel reopened this Jul 29, 2021
@lithomas1 lithomas1 modified the milestones: No action, Contributions Welcome Jul 29, 2021
@simonjayhawkins simonjayhawkins modified the milestones: Contributions Welcome, 1.3.2 Jul 30, 2021
@biiiipy
Copy link

biiiipy commented Aug 1, 2021

Having the same issue

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.2, 1.3.3 Aug 15, 2021
@simonjayhawkins simonjayhawkins modified the milestones: 1.3.3, 1.3.4 Sep 11, 2021
@simonjayhawkins
Copy link
Member

changing milestone to 1.3.5

@simonjayhawkins simonjayhawkins modified the milestones: 1.3.4, 1.3.5 Oct 16, 2021
@jbrockmendel
Copy link
Member

@gunthergl working on a fix for this and i could use help writing a test. starting with the dumped from the OP, id like to reconstruct from dumped and then assert that the result matches the original. how do i do that re-loading? yaml.load(dumped) doesnt do it

@gunthergl
Copy link
Author

Hi @jbrockmendel, pytorch-lightning uses yaml.UnsafeLoader - I read about the reason sometime somewhere but don't know anymore exactly. However, the following should be appropriate:

import pandas as pd

tmp = pd.DataFrame(
                {'col_name': [1,2,3,4]}
                )

import yaml
dumped = yaml.dump(tmp)  # Fails starting with pandas==1.3.0
print(dumped)

loaded_UnsafeLoader = yaml.load(dumped, Loader=yaml.UnsafeLoader)
print(loaded_UnsafeLoader)
loaded_Loader = yaml.load(dumped, Loader=yaml.Loader)
print(loaded_Loader)

assert tmp.equals(loaded_UnsafeLoader)
assert tmp.equals(loaded_Loader)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Compat pandas objects compatability with Numpy or Python functions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants