Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding the option to delete persisted user data as well. #8

Open
1kastner opened this issue Jun 15, 2020 · 22 comments
Open

Adding the option to delete persisted user data as well. #8

1kastner opened this issue Jun 15, 2020 · 22 comments
Labels
enhancement New feature or request

Comments

@1kastner
Copy link
Contributor

Proposed change

At https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/ recently it was discussed that in some settings the persisted data of a user should also eventually be removed. This could be integrated into this service. I am really not sure whether it should because it is spawner-specific or even configuration-specific how to delete the persisted user data.

Alternative options

We could say that these concerns should be addressed separately and a second service could be created. The chances are pretty high that there would be more code duplication though to identify user accounts that exceed a certain age and haven't been used for a while.

Who would use this feature?

This is reasonable for settings with temporary users, e.g. mybinder or weekend seminars. You are sure you want to delete their data after some point.

@1kastner 1kastner added the enhancement New feature or request label Jun 15, 2020
@manics
Copy link
Member

manics commented Jun 15, 2020

Together with #4 it almost sounds like we want a jupyter-admin cron utility. Perhaps with plugins for each function?

@1kastner
Copy link
Contributor Author

For me that sounds like a reasonable approach. General plugins could be maintained in separate repositories to keep this clean. I hope from such a plugin I could still somehow reach the JupyterHub configuration so that the plugin can look up the details, such as which user "owns" which docker volume that should be deleted.

Regarding #4, in some cases it could make sense to link the plugin with the data source of the authenticator since that is a place e.g. email addresses are also stored. But again this might be very configration-specific. An alternative would be to have a separate plugin configuration file. From what I have seen until now I guess it violates the design principles of one centralized JupyterHub configuration though? Here I have a lack of experience with the JuypterHub design philosophy.

@meeseeksmachine
Copy link

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/10

@1kastner
Copy link
Contributor Author

1kastner commented Jul 1, 2020

At https://jupyterhub.readthedocs.io/en/stable/reference/services.html I checked that a service can not access the loaded JupyterHub configuration. I see three options out there and I don't like any of them:

  1. Re-load the JupyterHub configuration inside this service to get the information regarding user-specific persisted data
  2. Maintain a separate cleaning scripts which leads to duplicated configuration (what is the user's prefix? ...)
  3. Refactor JupyterHub so that they can access the configuration and remove the previously constructed isolation

Any ideas on this?

@manics
Copy link
Member

manics commented Jul 1, 2020

The JupyterHub API lets you obtain state: object. Arbitrary internal state from this server's spawner. Only available on the hub's users list or get-user-by-name method, and only if a hub admin. None otherwise. for each of a user's servers:
https://jupyterhub.readthedocs.io/en/stable/_static/rest-api/index.html#/definitions/Server

There's also a discussion about making auth_state from the Authenticator accessible too:
jupyterhub/jupyterhub#1704

Do you think the combination of these would allow a service to request the necesary information?

@1kastner
Copy link
Contributor Author

1kastner commented Jul 1, 2020

In my example configuration here the important part regarding the Docker volumes is listed as such:

notebook_dir = '/home/jovyan/'
c.DockerSpawner.notebook_dir = notebook_dir
c.DockerSpawner.volumes = { 'jupyterhub-user-{username}': notebook_dir }

The data stored in c.DockerSpawner.volumes would be sufficient for my purpose of deleting docker volumes now but I can't speak for k8s configurations. For that I would need a partner on that side.

@1kastner
Copy link
Contributor Author

1kastner commented Jul 1, 2020

So let's see if I got you right here: You suggest the DockerSpawner tells via the server state which docker volume belongs to it (in my case jupyterhub-user-{username}) and this (already serliaized) information is added to the auth_state but is only visible if you are admin (which the JupyterHub service is).

Once the JupyterHub service has the information jupyterhub-user-user0, it can also run docker volume rm jupyterhub-user-user0 and everything is fine and solved! So yes, that is a viable solution.

@manics
Copy link
Member

manics commented Jul 1, 2020

Something like that! But this is outside my knowledge of JupyterHub, @minrk will know better.

@rkdarst
Copy link
Contributor

rkdarst commented Jul 1, 2020 via email

@1kastner
Copy link
Contributor Author

1kastner commented Jul 1, 2020

@rkdarst thank you so much for sharing that!

@meeseeksmachine
Copy link

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/13

@1kastner
Copy link
Contributor Author

1kastner commented Jul 2, 2020

Actually the server state is alreay published, see this code - thanks to all participants of this conversation to arrive there! Since the generic spawner api does not prescribe any spawner-specific content and the docker spawner does only add a little information, this only needs some additional information regarding the name of the docker volumes as presented before (see top).

Due to this input and also due to the discussion in the forum, I would suggest that I first create my personal variation of the dockerspawner that shares the information I need (i.e. the docker volume name) and second create a copy of this service using a different name. That service would only remove docker volumes if they are expired for a long time (we might want different times for culling idle Jupyter Notebooks and deleting whatever data the user believed we would persist for them).

For the long term, a plugin architecture as mentioned by manics sounds great to me too!

@rkdarst
Copy link
Contributor

rkdarst commented Jul 15, 2020

A plugin system for this somehow seems like a lot of work, since to me JupyterHub can already take plugins. But the difficulty of making a new service is too much, there is a lot of boilerplate. Imagine if there was...

  • A JupyterHub service library, that had the core event loop and periodic polling
  • Could be subclassed
  • Override methods to determine what happens on each poll. A method gets user and server (which includes auth_state, state, etc).
  • Then, a new service can be a single file that imports this library and does what it needs to. The service is started like any other service - python my_service.py - without needing to figure out an extra layer of a separate cron utility ("plugin" makes me think entrypoints, which is a whole other layer to think about, and then does it have to be separately installed?).

Of course I am biased since I have a working system... and don't have time implement what I am suggesting. But, I will work to use and debug whatever is implemented, if it doesn't add too many extra layers.

@1kastner
Copy link
Contributor Author

That sounds fine to me as long as I get the mentioned work done!

@manics
Copy link
Member

manics commented Jul 16, 2020

I think a service library also works, and even if we moved to a plugin architecture I expect the plugins would want this library anyway. What does everyone think about developing the library in this repo, then perhaps moving it to its own repo or JupyterHub core after it's had some production use?

@1kastner
Copy link
Contributor Author

From your quick explanation I have many detail-related conceptual questions, e.g. some visualizations might help etc.

Regarding your development plan I am not sure. Why do you think it should move back to the core? I did not mean to split the community - I would rather like to see that the code from this repository evolves step-by-step including backwards-compability. Do you think that is too difficult either in technical or project-administrative terms?

@manics
Copy link
Member

manics commented Jul 16, 2020

Ignore me! I misunderstood "A JupyterHub service library, that had the core event loop and periodic polling" as wanting a library in core JupyterHub. I'm completely happy for it to remain separate 🙂

@manics
Copy link
Member

manics commented Jul 16, 2020

Also would it be OK to keep the design discussions on one issue? Either this or #9, we can rename the issue title as necessary to make it clearer.

@1kastner
Copy link
Contributor Author

I am fine with that at #9 we can discuss some general design issues and here we discuss how this can be used for the purpose I have mentioned in the beginning of this issue, likewise #4 can pick up the results from #9 when implementing it. Therefore, i guess this issue might get less attention for a while until a common conceptual approach is found..

@AtulSinghBankoti
Copy link

Hi,

Is their any fix update for this issue?

We are also facing similar issue discussed here.
We have jupyterhub deployed in k8s cluster & we are using EFS(Elastic File System) as PV(Persistent Volume).
When we delete a jupyterhub user using admin panel. User is deleted but the PVC(Persistent Volume Claims) associated with that
user are not getting deleted. If we create a new user with same name, the PVC are getting attached to new user.

@meeseeksmachine
Copy link

This issue has been mentioned on Jupyter Community Forum. There might be relevant details there:

https://discourse.jupyter.org/t/a-cull-idle-user-service-that-deletes-pvs/4742/16

@ruchirkakkad
Copy link

Hi,

I do not want to remove admin user in Jupyterhub by cull. I have implemented Jupyterhub using helm chart.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants