Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: remove tests from the distribution #30741

Open
vfilimonov opened this issue Jan 6, 2020 · 30 comments · May be fixed by #53007
Open

Suggestion: remove tests from the distribution #30741

vfilimonov opened this issue Jan 6, 2020 · 30 comments · May be fixed by #53007
Assignees
Labels
Build Library building on various platforms Testing pandas testing functions or related to the test suite

Comments

@vfilimonov
Copy link
Contributor

Would it make sense to remove tests folder from the pandas distribution? It takes roughly 33% of the whole package weight.

It is especially important when using pandas inside the AWS Lambdas, where the deployment package size is limited to 50 MB zipped and 5 MB might really make a difference.

# Uncompressed
du -h -s pandas*
 46.5M	pandas
 30.9M	pandas_no_tests

# Compressed
du -h -s pandas*
 14.7M	pandas.zip
 10.1M	pandas_no_tests.zip
@TomAugspurger
Copy link
Contributor

I think we've talked about that in the past. We do have pandas.test() as part of the public API however, so we'd need to consider that.

A couple options:

  1. Provide a separate distribution like pandas-slim or something that excludes these files (and docs)
  2. Have pandas.test() fetch the source files on demand. That seems a bit messy though.

Just as a note: we do exclude the test data files that are present in the git repository. So we're only talking about source files.

@vfilimonov
Copy link
Contributor Author

Hello @TomAugspurger

pandas-slim sounds like an good workaround.

It looks like docs are not a part of the distribution.
And right it's tests code only without the data files - in terms of size they are second to _libs and almost equal to the rest of the code.

8.0K	./arrays
 16K	./errors
 24K	./api
 40K	./__pycache__
 76K	./compat
 88K	./_config
248K	./tseries
308K	./util
440K	./plotting
2.3M	./io
6.7M	./core
 17M	./tests
 20M	./_libs

And what is the reason of having tests as a part of an API? I see that numpy, scipy, matplotlib etc are doing the same (while many other libs, especially web-oriented like flask, requests, jinja don't)?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 6, 2020 via email

@stonecharioteer
Copy link

I particularly used to run the tests for numpy, scikitlearn and matplotlib after installing, since at times I'd have them fail on Windows. However this was quite some time ago, 4 years ago perhaps. Perhaps other users were doing the same?

@jonringer
Copy link

jonringer commented Jan 9, 2020

I'm a package manager for nixpkgs, and I'm against removing tests from the sdist package, however, removing from the wheel would make sense from a packaging standpoint. It's considered best practice in FOSS that if you distribute source, you also distribute tests along side it.

EDIT:
tests are a nice guarantee that the package is working as intended.

We could also checkout the github repo for tests. However, quickly looking at the setup.py, pandas was meant to have the CI set correct metadata such as version. So the version-controlled source can't be directly be used to package pandas.

@joaoe
Copy link

joaoe commented Jan 26, 2020

Hi. i reported this elsewhere, so I'm pasting my comment here.

My use case
After a pip install pandas the lib/site-packages/pandas/tests/ includes a lot of testing code which is definitely not relevant for me and many other end users of pandas.
This bloats the installation and makes installation slower.
I'm working on packaging a python environment to distribute with a preinstalled set of modules and application and there are too many popular 3rd-party modules which include unneeded test code, like numpy, IPython, jupyterlab, etc, which needs to be striped to keep the package size down. I'll be reporting issues to these projects as well.

Suggestion
Therefore, my suggestion is to keep the pandas module streamlined, and move the tests out. Perhaps create a pandas-unittests module if people are interested in it, or just expect users to checkout the code. Another possibility would be to skip packaging the tests folder and conftest.py when creating packages to upload to pypi.org.

Regarding pandas-slim, everyone and their mothers have a dependency on pandas which would pull the whole code again with tests.

It's considered best practice in FOSS that if you distribute source, you also distribute tests along side it.

That perfectly fine. The discussion is whether tests are bundled with the pandas module or not.

Since you are now almost releasing 1.0 it might be a bit short notice to include this is such a big release. But for the next major release, it could work.

Thank you very much for your attention.

@mroeschke mroeschke added Build Library building on various platforms Testing pandas testing functions or related to the test suite labels Jan 27, 2020
@vfilimonov
Copy link
Contributor Author

A small remark: as a part of recent commit to pyarrow @wesm removed pyarrow.tests from the wheel which to my understanding contributed 2.3 MB of ~60 MB installed size.

In case of pandas tests folder contributed (as of version 1.0.3) tests folder contributed 17.9 MB out of 49 MB installed size.

So I'd like to bring the question back to the discussion and perhaps, @wesm could comment on that?

@wesm
Copy link
Member

wesm commented Jun 3, 2020

I think it would be a good idea to not ship the tests in wheels. If you want users to be able to run the tests against their production installs perhaps the tests can be packaged as a separate source wheel. Install size is becoming a problem because of size constraints in things like AWS Lambda.

@TomAugspurger
Copy link
Contributor

https://uwekorn.com/2020/10/28/trimming-down-pyarrow-conda-2-of-x.html has some information.

I think I've come around to the idea that we can just not ship the test files in the main pandas distributions. We can have a separate pandas-tests so that pip install pandas-tests package that's just

  1. The test files
  2. A small __init__.py file ties things together

We could even update pandas.test() to check for the presence of the pandas-tests package.

@dazza-codes
Copy link

dazza-codes commented Oct 30, 2020

In rasterio/rasterio-wheels#59, I'm requesting some kind of shared-libs solution that might help the python package ecosystem for binary wheels and a similar request could be made for shared testing libs. In the context of testing, it might/could help to have separate numpy.testing and pandas.testing projects that might be integrated into a parent sci-testing or similar uber-testing package with shared dependencies (it likely impacts other projects that are also using a pattern of including some kind of {package}.test utilities by default (vs. optional extra package). It's likely that whatever pandas needs for testing might also depend on numpy.testing and if that package is split out of the upstream numpy (similar to numpy/numpy#17620 ??), then it would have implications for work on this and similar issues. (Might be useful to check a git grep 'numpy.testing' or similar grep on the pandas repo.)

@jreback
Copy link
Contributor

jreback commented Oct 30, 2020

pandas does not use numpy.testing a all

@virtuald
Copy link

I don't use pandas at all, but I saw the link from the numpy issue.

It's worth noting that with implicit namespace packages you could technically ship pandas.tests in a separate wheel and wouldn't break backwards compatibility (other than the new requirement). I've found the primary downside of namespace packages is that editable installations have annoying caveats, but in an installed system it seems to be supported just fine.

@jorisvandenbossche
Copy link
Member

For the conda side, @xhochy did this for pyarrow here conda-forge/arrow-cpp-feedstock#186, so that could be inspiration for doing the same for the pandas conda package.

@jameslamb
Copy link
Contributor

jameslamb commented Dec 30, 2020

It sounds like maintainers here are generally supportive of cutting tests and docs out of the package. I recently went through this exercise in LightGBM (microsoft/LightGBM#3639, microsoft/LightGBM#3685), and I'd be happy to do the work here to reduce the package size. Would the maintainers here be open to PRs if I started working on this?

Details & Background

I think Wes's comment about AWS Lambda is an important one: #30741 (comment). A few months ago, I worked on trying to use pandas + scikit-learn + lightgbm together in Lambda Layers, and had to do some weird stuff to make it work while staying under the 250 MB limit for layers, like this:

TEMP_LAYER_DIR=$(pwd)/temp-layer-dir
mkdir -p ${TEMP_LAYER_DIR}

echo "Creating 'pandas-layer'"
docker run \
    --rm \
    -v ${TEMP_LAYER_DIR}:/var/task \
    lambci/lambda:build-python3.7 \
        pip install --no-deps -t python/lib/python3.7/site-packages pandas==0.24.1 pytz

pushd ${TEMP_LAYER_DIR}
zip \
    --exclude \*/tests/\* \*dist-info\* \*/__pycache__\* \
    -r pandas-layer.zip \
    python
mv *.zip ${UPLOAD_DIR}/
rm -r python

I think a first step that wouldn't need to involve changes to pandas.test() could be removing documentation from the source distribution. I can see most of https://github.com/pandas-dev/pandas/tree/master/doc in the sdist package, even the cheatsheet powerpoints / PDFs.

wget https://files.pythonhosted.org/packages/75/bc/abf2e8cc6a9d918008774b958613cfdbd3a8c135cffb0757f78fabd8268f/pandas-1.2.0.tar.gz -O pandas.tar.gz
tar -xvf pandas.tar.gz
cat pandas-1.2.0/pandas.egg-info/SOURCES.txt | grep -E "^doc"

Thanks for your time and consideration!

@jreback
Copy link
Contributor

jreback commented Dec 30, 2020

yup would take PRs here

@jameslamb
Copy link
Contributor

Thanks! I've opened #38846 with a proposal to remove docs. It reduces the size of the sdist package by about 1MB.

I think that removing tests needs to be done by maintainers, though, if it will involve distributing a new pandas-tests package (#30741 (comment)).

I really hope that is picked up, as it would have a significant impact on the size of the package.

master #38846 #38846 + removing tests
sdist (compressed) 5.7M 4.5M 2.3M
sdist (uncompressed) 25M 19M 7.7M
python setup.py sdist > /dev/null
du -a -h dist/

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 5, 2021

Thanks! I've opened #38846 with a proposal to remove docs. It reduces the size of the sdist package by about 1MB.

I think that removing tests needs to be done by maintainers, though, if it will involve distributing a new pandas-tests package (#30741 (comment)).

I really hope that is picked up, as it would have a significant impact on the size of the package.

master #38846 #38846 + removing tests
sdist (compressed) 5.7M 4.5M 2.3M
sdist (uncompressed) 25M 19M 7.7M

python setup.py sdist > /dev/null
du -a -h dist/

@jameslamb Can you check your calcs on the difference if tests are removed? I think you included the folder tests/io/sas/data which takes up 12MB, but it is not in the distribution - just in the source repo.

@jameslamb
Copy link
Contributor

jameslamb commented Jan 5, 2021

I just tried removing that directory completely before running the size checks. The results are the same

# check size
rm -rf pandas.egg-info
rm -rf dist
python setup.py sdist
du -a -h dist/

# check size without tests/io/sas/data/
rm -rf pandas/tests/io/sas/data/
rm -rf pandas.egg-info
rm -rf dist
python setup.py sdist
du -a -h dist/

I also tried this with prune pandas/tests added to MANIFEST.in (which is how I originally calculated the "without tests" numbers).

with pandas/tests/io/sas/data in the repo

with pandas/tests/io/sas/data removed from the repo

When I check pandas.egg-info/SOURCES.txt in all cases, I don't see tests/io/sas/data anywhere.

This is the expected behavior. I think all the data files you're referring to will be excluded by these rules.

It is interesting that the new compressed size I'm seeing is different from when I ran these checks a week ago. But this is a very active project so I'm assuming that that's related to some new PRs that have been merged.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Jan 6, 2021

@jameslamb I was surprised that in your original numbers, with the uncompressed version, the result lowered from 19M to 7.7M, so that's why I thought the SAS data was the culprit.

If I look at my installation of pandas 1.1.3, the uncompressed installed tests directory contains just over 9MB. Still justifies figuring out if we can leave it out of the dist.

@jameslamb
Copy link
Contributor

Providing an update here, it seems that distributions of pandas have grown by maybe ~15MB (uncompressed) since I last calculated it in December (#30741 (comment)).

On my laptop (running Python 3.7 and Ubuntu 18.04), running the following

# install pandas 1.2.3
pip install --upgrade pandas

# check sizes
du -a ${HOME}/miniconda3/lib/python3.7/site-packages/pandas \
   | sort -n -r \
   | head -n 10

results in

68548   pandas
35944   pandas/tests
17188   pandas/_libs
10000   pandas/core
5028    pandas/tests/indexes
4996    pandas/_libs/tslibs
4700    pandas/tests/frame
4620    pandas/tests/io
3376    pandas/tests/series
3244    pandas/io

The tests directory now looks to be about 35MB uncompressed.

I would help with this, but I think that removing tests from pandas distributions is a fairly invasive change (see @TomAugspurger 's comment from October: #30741 (comment)) that has to be done by maintainers. I'm personally +1 on doing that, for the reasons mentioned in this thread and in #38846 (comment).

@TomAugspurger
Copy link
Contributor

I'm also +1 on removing tests from the distribution (and optionally pulling them down on demand for pandas.test()), but won't have time to work on it.

@jreback
Copy link
Contributor

jreback commented Mar 29, 2021

agree we should try to package separately but this would take some dedicated work

@jayqi
Copy link

jayqi commented Apr 9, 2021

Hi, I was following along with PR #38846 which was recently merged, and I wanted to discuss the idea that I repeatedly saw @WillAyd bring up: specifically the idea of implementing a pandas[no-doc] PEP-508-compliant dependency specification as a way to give users an option to opt out of documentation in their packages. Unfortunately, I think this idea is technically infeasible given the mechanics of PEP 508 dependencies. (Would be very happy to be proven wrong because this has been an annoying limitation in other projects.)

Basically, the problem is that dependency specifications only allow adding dependencies, not subtracting them. This is reflected in the actual PEP 508 language (emphasis mine):

A dependency specification always specifies a distribution name. It may include extras, which expand the dependencies of the named distribution to enable optional features.

An extra is an optional part of a distribution. Distributions can specify as many extras as they wish, and each extra results in the declaration of additional dependencies of the distribution when the extra is used in a dependency specification. For instance:

There an ongoing discussion on the Python Software Foundation discourse about how to implement dependency specifications that can be opted out of.

In the context of solutions for pandas, this means that the only way to give users an option of not having a dependency is to leave them out of the base distribution, and then having them be opt-in.

I'd also like to note that my understanding is that PEP 508 dependency specifications are for dependencies on other Python packages, and usually not about including additional package data. So it's not even clear to me how one would implement a pandas[docs] opt-in solution, though perhaps there is a creative way to do so.

@mroeschke
Copy link
Member

One potential difficulty before implementing this:

We note for EA authors that there is a set of test that can be run to check EA compatibility: https://pandas.pydata.org/docs/development/extending.html#testing-extension-arrays

So if tests were removed from the distribution, a subset of pandas/tests/extensions/ probably should not be removed

@viccsjain
Copy link

Splitting the pandas library and tests would be really useful. We are using this library in our serverless deployment. and there is size restriction to upload the package into AWS lambda of 250 MB. Removing tests file will reduce the size of our package.

@thesamesam
Copy link
Contributor

See also the discussion in #54907.

@jbsilva
Copy link

jbsilva commented Sep 25, 2023

Making docs and tests optional would be great.
In my cloud deployments I repackage it without the tests; 15 MB do make a difference for me.
I've seem many other packages including tests, but never that big.

@dolfinus
Copy link

dolfinus commented May 30, 2024

I was checking the size of one of my docker images, and found that tests are about 50% of the size of installed package:
изображение

Completely waste of space for me.

@jonas-w
Copy link

jonas-w commented Aug 3, 2024

According to https://pypistats.org/packages/pandas the package has 240 million downloads per month.

Now if the 32MB tests folder from the package would be removed, the package size would be halved. Currently the wheels are roughly 13MB large, so let's say the wheels would be 7MB after removing the tests, then pypi would save ~1.7 Petabyte of Bandwidth per Month, and could have saved roughly ~90 Petabytes of traffic since this issue was opened...

@takluyver
Copy link
Contributor

I just noticed that on one of our filesystems, which is not set up for lots of small files, pandas' tests end up taking 300 MB of installed size.

Would it work as an initial step to make a script which splits the tests out of the wheels to be uploaded, and makes separate pandas-test wheels which can be uploaded separately? This is obviously not the most elegant way to do it, but I think I can see more or less how to make that work, whereas I'm not sure I can commit the time to figuring out how to rework pandas' build scripts and CI config to produce & use two separate wheels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Build Library building on various platforms Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging a pull request may close this issue.