-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: remove tests from the distribution #30741
Comments
I think we've talked about that in the past. We do have A couple options:
Just as a note: we do exclude the test data files that are present in the git repository. So we're only talking about source files. |
Hello @TomAugspurger
It looks like docs are not a part of the distribution.
And what is the reason of having tests as a part of an API? I see that numpy, scipy, matplotlib etc are doing the same (while many other libs, especially web-oriented like flask, requests, jinja don't)? |
Thanks for correcting me on the docs stuff.
I'm not sure why all these projects tend to include tests. As a guess, I
suspect that we have more issues related to how the packages are built
(compilers, flags, libraries we link to) than they do.
…On Mon, Jan 6, 2020 at 1:56 PM Vladimir Filimonov ***@***.***> wrote:
Hello @TomAugspurger <https://github.com/TomAugspurger>
pandas-slim sounds like an good workaround.
It looks like docs are not a part of the distribution.
And right it's tests code only without the data files - in terms of size
they are second to _libs and almost equal to the rest of the code.
8.0K ./arrays
16K ./errors
24K ./api
40K ./__pycache__
76K ./compat
88K ./_config
248K ./tseries
308K ./util
440K ./plotting
2.3M ./io
6.7M ./core
17M ./tests
20M ./_libs
And what is the reason of having tests as a part of an API? I see that
numpy, scipy, matplotlib etc are doing the same (while many other libs,
especially web-oriented like flask, requests, jinja don't)?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#30741?email_source=notifications&email_token=AAKAOIUZLK4U3RLNROXYKADQ4OEHPA5CNFSM4KDF2MLKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIGS7KI#issuecomment-571289513>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOIR5SCPXJ5N7HFOBDRDQ4OEHPANCNFSM4KDF2MLA>
.
|
I particularly used to run the tests for numpy, scikitlearn and matplotlib after installing, since at times I'd have them fail on Windows. However this was quite some time ago, 4 years ago perhaps. Perhaps other users were doing the same? |
I'm a package manager for nixpkgs, and I'm against removing tests from the sdist package, however, removing from the wheel would make sense from a packaging standpoint. It's considered best practice in FOSS that if you distribute source, you also distribute tests along side it. EDIT: We could also checkout the github repo for tests. However, quickly looking at the setup.py, pandas was meant to have the CI set correct metadata such as version. So the version-controlled source can't be directly be used to package pandas. |
Hi. i reported this elsewhere, so I'm pasting my comment here. My use case Suggestion Regarding
That perfectly fine. The discussion is whether tests are bundled with the Since you are now almost releasing 1.0 it might be a bit short notice to include this is such a big release. But for the next major release, it could work. Thank you very much for your attention. |
A small remark: as a part of recent commit to pyarrow @wesm removed In case of pandas tests folder contributed (as of version 1.0.3) tests folder contributed 17.9 MB out of 49 MB installed size. So I'd like to bring the question back to the discussion and perhaps, @wesm could comment on that? |
I think it would be a good idea to not ship the tests in wheels. If you want users to be able to run the tests against their production installs perhaps the tests can be packaged as a separate source wheel. Install size is becoming a problem because of size constraints in things like AWS Lambda. |
https://uwekorn.com/2020/10/28/trimming-down-pyarrow-conda-2-of-x.html has some information. I think I've come around to the idea that we can just not ship the test files in the main
We could even update |
In rasterio/rasterio-wheels#59, I'm requesting some kind of shared-libs solution that might help the python package ecosystem for binary wheels and a similar request could be made for shared testing libs. In the context of testing, it might/could help to have separate |
pandas does not use numpy.testing a all |
I don't use pandas at all, but I saw the link from the numpy issue. It's worth noting that with implicit namespace packages you could technically ship |
For the conda side, @xhochy did this for pyarrow here conda-forge/arrow-cpp-feedstock#186, so that could be inspiration for doing the same for the pandas conda package. |
It sounds like maintainers here are generally supportive of cutting tests and docs out of the package. I recently went through this exercise in LightGBM (microsoft/LightGBM#3639, microsoft/LightGBM#3685), and I'd be happy to do the work here to reduce the package size. Would the maintainers here be open to PRs if I started working on this? Details & BackgroundI think Wes's comment about AWS Lambda is an important one: #30741 (comment). A few months ago, I worked on trying to use TEMP_LAYER_DIR=$(pwd)/temp-layer-dir
mkdir -p ${TEMP_LAYER_DIR}
echo "Creating 'pandas-layer'"
docker run \
--rm \
-v ${TEMP_LAYER_DIR}:/var/task \
lambci/lambda:build-python3.7 \
pip install --no-deps -t python/lib/python3.7/site-packages pandas==0.24.1 pytz
pushd ${TEMP_LAYER_DIR}
zip \
--exclude \*/tests/\* \*dist-info\* \*/__pycache__\* \
-r pandas-layer.zip \
python
mv *.zip ${UPLOAD_DIR}/
rm -r python I think a first step that wouldn't need to involve changes to wget https://files.pythonhosted.org/packages/75/bc/abf2e8cc6a9d918008774b958613cfdbd3a8c135cffb0757f78fabd8268f/pandas-1.2.0.tar.gz -O pandas.tar.gz
tar -xvf pandas.tar.gz
cat pandas-1.2.0/pandas.egg-info/SOURCES.txt | grep -E "^doc" Thanks for your time and consideration! |
yup would take PRs here |
Thanks! I've opened #38846 with a proposal to remove docs. It reduces the size of the sdist package by about 1MB. I think that removing tests needs to be done by maintainers, though, if it will involve distributing a new I really hope that is picked up, as it would have a significant impact on the size of the package.
python setup.py sdist > /dev/null
du -a -h dist/ |
@jameslamb Can you check your calcs on the difference if tests are removed? I think you included the folder |
I just tried removing that directory completely before running the size checks. The results are the same # check size
rm -rf pandas.egg-info
rm -rf dist
python setup.py sdist
du -a -h dist/
# check size without tests/io/sas/data/
rm -rf pandas/tests/io/sas/data/
rm -rf pandas.egg-info
rm -rf dist
python setup.py sdist
du -a -h dist/ I also tried this with with
with
When I check This is the expected behavior. I think all the data files you're referring to will be excluded by these rules.
It is interesting that the new compressed size I'm seeing is different from when I ran these checks a week ago. But this is a very active project so I'm assuming that that's related to some new PRs that have been merged. |
@jameslamb I was surprised that in your original numbers, with the uncompressed version, the result lowered from 19M to 7.7M, so that's why I thought the SAS data was the culprit. If I look at my installation of pandas 1.1.3, the uncompressed installed tests directory contains just over 9MB. Still justifies figuring out if we can leave it out of the dist. |
Providing an update here, it seems that distributions of On my laptop (running Python 3.7 and Ubuntu 18.04), running the following # install pandas 1.2.3
pip install --upgrade pandas
# check sizes
du -a ${HOME}/miniconda3/lib/python3.7/site-packages/pandas \
| sort -n -r \
| head -n 10 results in
The tests directory now looks to be about 35MB uncompressed. I would help with this, but I think that removing tests from |
I'm also +1 on removing tests from the distribution (and optionally pulling them down on demand for |
agree we should try to package separately but this would take some dedicated work |
Hi, I was following along with PR #38846 which was recently merged, and I wanted to discuss the idea that I repeatedly saw @WillAyd bring up: specifically the idea of implementing a Basically, the problem is that dependency specifications only allow adding dependencies, not subtracting them. This is reflected in the actual PEP 508 language (emphasis mine):
There an ongoing discussion on the Python Software Foundation discourse about how to implement dependency specifications that can be opted out of. In the context of solutions for pandas, this means that the only way to give users an option of not having a dependency is to leave them out of the base distribution, and then having them be opt-in. I'd also like to note that my understanding is that PEP 508 dependency specifications are for dependencies on other Python packages, and usually not about including additional package data. So it's not even clear to me how one would implement a |
One potential difficulty before implementing this: We note for EA authors that there is a set of test that can be run to check EA compatibility: https://pandas.pydata.org/docs/development/extending.html#testing-extension-arrays So if tests were removed from the distribution, a subset of |
Splitting the pandas library and tests would be really useful. We are using this library in our serverless deployment. and there is size restriction to upload the package into AWS lambda of 250 MB. Removing tests file will reduce the size of our package. |
See also the discussion in #54907. |
Making docs and tests optional would be great. |
According to https://pypistats.org/packages/pandas the package has 240 million downloads per month. Now if the 32MB tests folder from the package would be removed, the package size would be halved. Currently the wheels are roughly 13MB large, so let's say the wheels would be 7MB after removing the tests, then pypi would save ~1.7 Petabyte of Bandwidth per Month, and could have saved roughly ~90 Petabytes of traffic since this issue was opened... |
I just noticed that on one of our filesystems, which is not set up for lots of small files, pandas' tests end up taking 300 MB of installed size. Would it work as an initial step to make a script which splits the tests out of the wheels to be uploaded, and makes separate pandas-test wheels which can be uploaded separately? This is obviously not the most elegant way to do it, but I think I can see more or less how to make that work, whereas I'm not sure I can commit the time to figuring out how to rework pandas' build scripts and CI config to produce & use two separate wheels. |
Would it make sense to remove
tests
folder from the pandas distribution? It takes roughly 33% of the whole package weight.It is especially important when using pandas inside the AWS Lambdas, where the deployment package size is limited to 50 MB zipped and 5 MB might really make a difference.
The text was updated successfully, but these errors were encountered: