Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Popularity Dashboard Updates! #3571

Open
2 tasks
balajialg opened this issue Aug 12, 2022 · 9 comments
Open
2 tasks

Python Popularity Dashboard Updates! #3571

balajialg opened this issue Aug 12, 2022 · 9 comments
Assignees
Labels
enhancement Issues around improving existing functionality

Comments

@balajialg
Copy link
Contributor

balajialg commented Aug 12, 2022

Summary

Thanks to our last sprint planning meeting, I spent some time trying to figure out the ways to use the python popularity dashboard to make recommendations for the docker image updates. I accessed the python popularity dashboard and filtered the data for the last 6 months with the intention of identifying packages that are least used across all the hubs during the Summer and Spring semesters.

I realized that we have amazing package installation data for the following hubs - i) Datahub, ii) Data 8, iii) Data 100, iv) Data 102, v) Biology, vi) Julia, vii) D-Lab, and viii) Prob 140 hubs. However, the following hubs do not have their package installation data displayed in the dashboard - i) Astro, ii) EECS, iii) High School, iv) ISchool, v) Stat 159, and vi) Stat 20 hubs. One recommendation is to fix the dashboard to reflect the data associated with these hubs.

I wanted to generate a list of packages that had less than 5 installations during the past 6 months meaning it was rarely used as part of any of the assignments. Listing the least used packages across the hubs for which I had access to the data.

Edit:
Focus will be on identifying packages that are not listed as part of the Python Popularity Dashboard as they would have had 0 installs!

Biology Hub:

Packages Installed Number of times installed in the past 6 months
requests 2
protobuf 2
opt-einsum 2
keras 2
httplib2 2
h5py 2
google-auth-httplib2 2
google-auth 2
google-api-python-client 2
google-api-core 2
gast 2
flatbuffers 2
charset-normalizer 2
astunparse 2
absl-py 2
PySocks 2
Keras-Preprocessing 2
screed 2
bz2file 2
cryptography 2
certifi 2
cachetools 2

Data 100 Hub

Packages Installed Number of times installed in the past 6 months
Pint 2
Babel 2
lyricsgenius 2
opencv-python 2
opencv-contrib-python 2
conda 2
prob140 2
XlsxWriter 2
imbalanced-learn 1
featurewiz 1
category-encoders 1
HeapDict 1
pyarrow 1
uncertainties 1
Pint 1
Babel 1
lyricsgenius 1
dm-tree 1
langcodes 1
zict 1
lightgbm 1
imbalanced-learn 1
featurewiz 1
category-encoders 1

Data 102 Hub

Packages Installed Number of times installed in the past 6 months
chart-studio 1
retrying 1
tika 1
pdfplumber 1
wordcloud 1
psycopg2 1
psycopg2-binary 1
timer 1
causalgraphicalmodels 1
plot-utils 1
huggingface-hub 1
tokenizers 1
transformers 1
ImageHash 1
timer 1
htmlmin 1
missingno 1
multimethod 1
pandas-profiling 1
phik 1
huggingface-hub 1
htmlmin 1
missingno 1
multimethod 1

Data 8 Hub

Packages Installed Number of times installed in the past 6 months
networkx 1
PyYAML 1
TPOT 1
bokeh 1
deap 1
scikit-optimize 1
stopit 1
update-checker 2
wget 2
conda 2
gdflib 2
treelib 2
websockets 2
selenium 2
monty 3
pydantic 3
pymatgen 3
spglib 3
uncertainties 3

Datahub

Packages Installed Number of times installed in the past 6 months
natsort 1
pandana 1
control 1
RateMyProfessorAPI 1
rectpack 1
pyxdf 1
jupyterthemes 1
rectpack 1
pyxdf 1
jupyterthemes 1
pyxdf 1
jupyterthemes 1
lesscpy 1
en-core-web-lg 1
layoutparser 1
aspose-words 1
en-core-web-lg 1
layoutparser 1
aspose-words 1
cpi 1
forestci 1
pyspark 1
databricks-connect 1
casadi 1
pg-ethics 1
googlemaps 1
pyspark 1
databricks-connect 1
casadi 1
pg-ethics 1
googlemaps 1
rpy2 1

Tasks to complete

  • Fix the dashboard to display package installation data for all the hubs.
  • Remove the above-mentioned packages from the docker image one hub at a time (unless there is strong opposition to the proposed path forward.)
@balajialg balajialg self-assigned this Aug 12, 2022
@balajialg balajialg added the enhancement Issues around improving existing functionality label Aug 12, 2022
@balajialg balajialg assigned felder and unassigned balajialg Aug 12, 2022
@yuvipanda
Copy link
Contributor

yuvipanda commented Aug 22, 2022

Thanks for working on this, @balajialg!

I think these lists actually illustrate why actully using this data to prune libraries is a bunch of work that is possible, but needs to be carefully considered - transitive dependencies. For example, if we take the biology hub and remove requests, a lot of other packages will just stop working, as they use it transitively for making HTTP requests. If you uninstall requests, okpy stops working and grading using that comes to a halt. I only know that okpy is using requests because I was trying to fix something related (okpy/ok-client#473), but am sure there's a lot of other packages that we don't know about. Same applies for at least the cryptography, certifi and cachetools package, and I'm sure more.

I think the easy use of the dashboard is to ask 'can I remove this one specific package that I want to remove for other reasons?'. It can definitely be used to prune list of installed packages, but will need more work to identify which can be removed and which can't be.

@felder
Copy link
Contributor

felder commented Aug 22, 2022

@balajialg What yuvi is expressing was my concern as well. Just because a package isn't popular doesn't mean it isn't required by something else that is. Like I see conda for data100, I'd question removing that one as well!

@ryanlovett
Copy link
Collaborator

@yuvipanda Does the popcon support track just the explicit imports by user code, or does it also track the imports that get pulled in under the hood? I thought it was hooking into the import mechanism at a low level so was catching everything, but given that okpy uses requests, it sounds like I've got that wrong.

@balajialg
Copy link
Contributor Author

balajialg commented Aug 23, 2022

@yuvipanda I did not realize that transitive dependency is going to be a huge challenge with regard to this effort. Thanks for pointing it out! With the context you shared, a few questions I have are,

  1. Is there a way to track the packages "installed under the hood" (Thanks @rylo for coining this term) as part of the Python Popularity Dashboard? If not, what other ways exist to find transitive dependencies? Is it by trial and error mechanism or is there a process to the madness?
  2. Considering the trade-off involved in accounting for transitive dependencies, Should we prune the image based on our exploration or just let the docker image grow large? (to the point where we realize that pruning is the only way forward? or have we reached the stage already?)

@yuvipanda
Copy link
Contributor

@ryanlovett it pulls in all imports, transitive or otherwise.

@balajialg Are you talking about 'installation' which happens only once in the image when we build it, or 'use' which is what this dashboard is tracking?

I think the way to us this dashboard is:

  1. Look at a package we are explicitly installing in our image, and see we want to remove it for some reason (I removed allensdk in Bump some base packages #3608 for example, because it required a very specific version of pandas that was pretty old).
  2. Use the dashboard to determine if anyone is actually using it, or if it can be removed.

So the dashboard can be used to determine if a package we want to remove can be removed, but can not be used to 'generate a list of packages to be removed'.

@yuvipanda
Copy link
Contributor

A pruning process would look like:

  1. Look at https://github.com/berkeley-dsep-infra/datahub/blob/staging/deployments/datahub/images/default/requirements.txt
  2. Consider bunches of packages installed for specific classes
  3. Investigate if they have been used at all. non-zero use count already complicates things
  4. But if use count is 0, it can be removed
  5. Repeat and see how much smaller our image gets!

@balajialg
Copy link
Contributor Author

balajialg commented Aug 23, 2022

@yuvipanda Thanks for the detailed pruning process! I referred to the term "use" - how many times Python libraries were imported in a notebook specific to a hub. Given the context that almost all the installed packages were used at least once in the past 6 months across all the hubs, what is that specific non-zero number for usage that we can safely assume (safely - devoid of transitive dependencies) as the threshold for removal? If we strongly believe that it should be 0, I am not sure whether this process will have any impact on the docker image size.

@yuvipanda
Copy link
Contributor

Given the context that almost all the installed packages were used at least once in the past 6 months across all the hubs

I don't think this is true - I found no uses of allensdk before I removed it, for example. Packages that aren't used just don't show up in the popularity dashboard.

@balajialg
Copy link
Contributor Author

balajialg commented Aug 24, 2022

@yuvipanda oh wow, that changes our approach drastically. Now, this list should be compiled based on the differences between our docker image and the popularity dashboard, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Issues around improving existing functionality
Projects
None yet
Development

No branches or pull requests

4 participants