Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix pandas compatibility with Python installations lacking bzip2 headers #53858

Merged
merged 10 commits into from
Jul 7, 2023

Conversation

MilesCranmer
Copy link
Contributor

@MilesCranmer MilesCranmer commented Jun 26, 2023

This simple change makes the bz2 import optional. Some Python installations do not contain bzip2 headers, so unless you are working with bzip compression, there is no reason for pandas to eagerly load (and require) a bz2 import.

@MilesCranmer MilesCranmer changed the title BUG: Make pandas compatible with Python installations lacking bzip2 headers BUG: Fix pandas compatibility with Python installations lacking bzip2 headers Jun 26, 2023
@mroeschke mroeschke added the IO Data IO issues that don't fit into a more specific label label Jun 26, 2023
@lithomas1
Copy link
Member

Thanks for the PR.

Can you add a test?

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Jun 26, 2023

In principle, yes, but because bz2 is a Python standard library, adding a test for this would require setting up a new GitHub action so that we could install a particular lightweight Python that lacks the libbzip2 headers. The existing bz2 tests will already hit the behavior when bz2 is compiled; what we can't test easily is the inverse.

The current macOS-latest image on GitHub actions with a Python 3.7 install lacks the libbzip2 headers. However, apparently this is not intended behavior and will be patched soon: actions/setup-python#682. I noticed this bz2 requirement only because of that bug in the GitHub action, but I thought I might as well fix it anyways as it seems like libbzip2 isn't always installed.

@lithomas1
Copy link
Member

Maybe we can just check that bz2 is not in sys.modules?

@MilesCranmer
Copy link
Contributor Author

Do you mean rather than catching the ImportError? I guess we could but it would be less readable. I didn't even know what sys.modules was until you mentioned it and I looked it up.

@lithomas1
Copy link
Member

I meant for the test. Everything else looks good.

I was thinking something like this.

>>> import sys
>>> sys.modules.pop('bz2', None) # Remove 'bz2' from sys.modules if it exists
>>> import pandas as pd
>>> sys.modules.pop('bz2', None) # Check for 'bz2' in sys.modules
<module 'bz2' from '/Users/thomasli/opt/miniconda3/lib/python3.9/bz2.py'>

(except for the last line, you would use an assert I think).

Does this work?

@MilesCranmer
Copy link
Contributor Author

Ah, I understand. That sounds like it should work. Where should I put that test so as to not interfere with those that require bz2?

@lithomas1
Copy link
Member

Ah, I understand. That sounds like it should work. Where should I put that test so as to not interfere with those that require bz2?

I think any one of the test files that doesn't already explictly import bz2 with (import bz2) should be fine. Maybe pandas/tests/io/test_common.py?

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Jun 26, 2023

It looks like that one has a compression test:

    @pytest.mark.parametrize("encoding", ["utf-16", "utf-32"])
    @pytest.mark.parametrize("compression_", ["bz2", "xz"])
    def test_warning_missing_utf_bom(self, encoding, compression_):
        """
        bz2 and xz do not write the byte order mark (BOM) for utf-16/32.

        https://stackoverflow.com/questions/55171439

        GH 35681
        """

So I guess it would break that test?

I'll do the following for a test FYI:

...
sys.modules.pop("bz2", None)  # Remove 'bz2' from available modules for testing
import pandas as pd
...
from pandas.compat import get_bz2_file

...

def test_bz2_nonimport():
    assert "bz2" not in sys.modules
    msg = "bz2 module not available."
    with pytest.raises(RuntimeError, match=msg):
        get_bz2_file()

@lithomas1
Copy link
Member

You can try pandas/tests/test_common.py as well.

That should probably be safer.

pandas/tests/test_common.py Outdated Show resolved Hide resolved
@lithomas1
Copy link
Member

I think you're missing the first test now, which is to check that bz2 is not in sys.modules after importing pandas.

Can you add that test (similar to how you're doing the other one now)?

@MilesCranmer
Copy link
Contributor Author

Sorry I'm not sure I understand. Wouldn't bz2 always be in sys.modules in this test? We are just setting it to None. Why would importing pandas remove bz2 from sys.modules?

@MilesCranmer
Copy link
Contributor Author

Popping bz2 from the modules does not trigger an import error, if that is the point of confusion:

[ins] In [1]: import sys

[ins] In [2]: sys.modules.pop('bz2', None)
Out[2]: <module 'bz2' from '/Users/mcranmer/mambaforge/envs/pandas-dev/lib/python3.10/bz2.py'>

[ins] In [3]: sys.modules['bz2']
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[3], line 1
----> 1 sys.modules['bz2']

KeyError: 'bz2'

[ins] In [4]: import pandas as pd

[nav] In [5]: sys.modules['bz2']
Out[5]: <module 'bz2' from '/Users/mcranmer/mambaforge/envs/pandas-dev/lib/python3.10/bz2.py'>

Only sys.modules['bz2'] = None could trigger the import error.

@lithomas1
Copy link
Member

I think you're missing the first test now, which is to check that bz2 is not in sys.modules after importing pandas.

Can you add that test (similar to how you're doing the other one now)?

To clarify, I meant to check that bz2 is not in sys.modules after importing pandas. If you popped bz2 out of sys.modules than it should not appear again after importing pandas.

To avoid messing with the other tests, you should launch a new Python process like the other test.

The way the import error test is done now is fine.

@MilesCranmer
Copy link
Contributor Author

To clarify, I meant to check that bz2 is not in sys.modules after importing pandas. If you popped bz2 out of sys.modules than it should not appear again after importing pandas.

This is what I showed in the example above. Even after popping it, it reappears in sys.module.

@rhshadrach

This comment was marked as resolved.

@MilesCranmer
Copy link
Contributor Author

MilesCranmer commented Jun 28, 2023

I thought this PR was meant to make it so bz2 is not imported when you do import pandas as pd but only when needed. Doesn't this mean importing pandas on a setup that doesn't have bz2 will fail?

No, bz2 is still imported when loading pandas, but now it only happens when bz2 is installed. I did it this way to be consistent with the other optional libraries in pandas (e.g., lzma), but I could change it; let me know.

What happens at runtime now is that it checks whether bz2 was loaded or not before importing the BZ2File class.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@mroeschke mroeschke added this to the 2.1 milestone Jul 7, 2023
@mroeschke mroeschke merged commit 4576909 into pandas-dev:main Jul 7, 2023
32 checks passed
@mroeschke
Copy link
Member

Thanks @MilesCranmer

Daquisu pushed a commit to Daquisu/pandas that referenced this pull request Jul 8, 2023
… headers (pandas-dev#53858)

* BUG: Make bz2 import optional

* CLN: Create `get_bz2_file` to match `get_lzma_file`

* DOC: Add bz2 bugfix to changelog

* TST: Test bz2 non-import works

* TST: Test bz2 non-import from subprocess

* TST: Fix bz2 non-import test

* TST: Fix indentation issues in bz2 import test

* MAINT: Clean up merge commit

* Mark bz2 missing test with `single_cpu`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas incompatible with Python installations lacking bzip2 headers
5 participants