-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python Popularity Dashboard Updates! #3571
Comments
Thanks for working on this, @balajialg! I think these lists actually illustrate why actully using this data to prune libraries is a bunch of work that is possible, but needs to be carefully considered - transitive dependencies. For example, if we take the biology hub and remove I think the easy use of the dashboard is to ask 'can I remove this one specific package that I want to remove for other reasons?'. It can definitely be used to prune list of installed packages, but will need more work to identify which can be removed and which can't be. |
@balajialg What yuvi is expressing was my concern as well. Just because a package isn't popular doesn't mean it isn't required by something else that is. Like I see conda for data100, I'd question removing that one as well! |
@yuvipanda Does the popcon support track just the explicit imports by user code, or does it also track the imports that get pulled in under the hood? I thought it was hooking into the import mechanism at a low level so was catching everything, but given that okpy uses requests, it sounds like I've got that wrong. |
@yuvipanda I did not realize that transitive dependency is going to be a huge challenge with regard to this effort. Thanks for pointing it out! With the context you shared, a few questions I have are,
|
@ryanlovett it pulls in all imports, transitive or otherwise. @balajialg Are you talking about 'installation' which happens only once in the image when we build it, or 'use' which is what this dashboard is tracking? I think the way to us this dashboard is:
So the dashboard can be used to determine if a package we want to remove can be removed, but can not be used to 'generate a list of packages to be removed'. |
A pruning process would look like:
|
@yuvipanda Thanks for the detailed pruning process! I referred to the term "use" - how many times Python libraries were imported in a notebook specific to a hub. Given the context that almost all the installed packages were used at least once in the past 6 months across all the hubs, what is that specific non-zero number for usage that we can safely assume (safely - devoid of transitive dependencies) as the threshold for removal? If we strongly believe that it should be 0, I am not sure whether this process will have any impact on the docker image size. |
I don't think this is true - I found no uses of |
@yuvipanda oh wow, that changes our approach drastically. Now, this list should be compiled based on the differences between our docker image and the popularity dashboard, right? |
Summary
Thanks to our last sprint planning meeting, I spent some time trying to figure out the ways to use the python popularity dashboard to make recommendations for the docker image updates. I accessed the python popularity dashboard and filtered the data for the last 6 months with the intention of identifying packages that are least used across all the hubs during the Summer and Spring semesters.
I realized that we have amazing package installation data for the following hubs - i) Datahub, ii) Data 8, iii) Data 100, iv) Data 102, v) Biology, vi) Julia, vii) D-Lab, and viii) Prob 140 hubs. However, the following hubs do not have their package installation data displayed in the dashboard - i) Astro, ii) EECS, iii) High School, iv) ISchool, v) Stat 159, and vi) Stat 20 hubs. One recommendation is to fix the dashboard to reflect the data associated with these hubs.
I wanted to generate a list of packages that had less than 5 installations during the past 6 months meaning it was rarely used as part of any of the assignments. Listing the least used packages across the hubs for which I had access to the data.
Edit:
Focus will be on identifying packages that are not listed as part of the Python Popularity Dashboard as they would have had 0 installs!
Biology Hub:
Data 100 Hub
Data 102 Hub
Data 8 Hub
Datahub
Tasks to complete
The text was updated successfully, but these errors were encountered: