This repo contains the Dockerfile to build the notebook image as well as the notebooks used in the MSD-LIVE deployment. It will rebuild the image whenever changes are pushed to the main and dev branches.
The data folder is too big, so we are not checking this into github. You will have to pull from s3 (instructions below) if you want to test locally
- Create a new git repo
- Repo must be in the MSD-LIVE git org
- Select the template-project-jupyter-notebook as the repository template
- The repo name must start with
jupyter-notebook-. The domain of this notebook when running on MSD-LIVE's jupyter services will be whatever comes after (i.e. the cerf repo is named jupyter-notebook-cerf and the URL to it's notebooks hosted by MSD-LIVE ishttps://cerf.msdlive.org) - The repo must be public.
- Find/replace
jupyter-notebook-tethysin docker-compose andjupyter-notebook-tethysin this readme with your repo name - Set PROJECT environment var in git:
- After the repo has been created from the github UI go to Settings, from left click on Secrets and variables and select Actions
- Click on the Variables tab, click the green New repository variable button
- For Name enter
PROJECTand value should be a project in MSD-LIVE like IM3 or GCIMS (the notebook will fail to launch from MSD-LIVE's services if not set)
- You may need to modify the
.gitignoreif your notebooks include config files or images.
- Your Dockerfile needs to:
- Extend one of our base images:
FROM ghcr.io/msd-live/jupyter/python-notebook:latest FROM ghcr.io/msd-live/jupyter/r-notebook:latest FROM ghcr.io/msd-live/jupyter/julia-notebook:latest FROM ghcr.io/msd-live/jupyter/base-panel-jupyter-notebook:latest - Copy in the notebooks and any other files needed in order to run. When the container starts everything in the /home/jovyan folder will be copied to the current user's home folder
COPY notebooks /home/jovyan/notebooks
- Extend one of our base images:
- Containers extending one of these base images will have a
DATA_DIRenvironment variable set and the value will be the path to the read-only staged input data, or/data. There will also be a symbolic link created in the user's home folder named 'data' that points to/datawhen the container starts. - Notebook implementations should look for the DATA_DIR environment variable and if set use that path as the input data used instead of downloading it. For an example of this see this example
- Some notebook libraries expect data to be located within the package. For this, feel free to add a symbolic link from
/datato the package via the Dockerfile. Here is an example of doing that:RUN rm -rf /opt/conda/lib/python3.11/site-packages/cerf/data RUN ln -s /data /opt/conda/lib/python3.11/site-packages/cerf/data
- Your repo's dev branch builds the image and tags it with 'dev', the main branch tags the image with 'latest'
- After the initial build go to MSD-LIVE's packages in github click on your package, click on settings to the right, scroll to the bottom of the settings page and make sure the 'package visibility' is set to public (the notebook will fail to launch from MSD-LIVE's services if not set)
Here are some ways to add specific behaviors for notebook containers. Note these are advanced use cases and not necessary for most deployments.
- Project notebook deployments can include a plugin to implement custom behaviors such as copying the input folder to the user's home folder because it cannot be read-only. Here is an exmple of this behavior but is essentially these steps:
- Dockerfile needs to copy in and install the extension:
COPY msdlive_hooks /srv/jupyter/extensions/msdlive_hooks RUN pip install /srv/jupyter/extensions/msdlive_hooks- setup.py uses entry_points so this plugin is discoverable to MSD-LIVE's
- The implementation removes the 'data' symlink from the user's home and and copies it in from /data instead
- Deployments can include a service to run within the notebook container. See this example of how a database (basex) is started via the container's entry point.
- Deployments can include a service proxied by Jupyter in order for it to have authenticated web access. See proxy docs here and MSD-LIVE notes about it's use here
-
Get the data (requires .aws/credentials to be set or use of aws access tokens [see next section on how to get and use])
# make sure you are in the jupyter-notebook-jupyter-notebook-tethys folder mkdir data cd data aws s3 cp s3://jupyter-notebook-tethys-notebook-bucket/data . --recursive
-
Start the notebook via docker compose
# make sure you are in the jupyter-notebook-jupyter-notebook-tethys folder cd .. docker compose up
- An MSD-LIVE developer will have to follow the steps here to add this as a new project notebook deployment (optionally to dev) in the prod config file.
- Once added, there will be an s3 bucket that this notebook's input data will need to be uploaded to. The folder uploaded to the bucket must be named 'data'.
- Data in the s3 bucket gets populated in one of these ways:
- Send the data or a link to an MSD-LIVE developer who can use the aws s3 console to upload to the bucket
- An MSD-LIVE developer can create aws tokens for the IAM user created when adding this project notebook deployment and securely send those tokens to the data owner to use to upload to the bucket. Links to AWS’s CLI documentation that will be helpful:
- How to use the access key: Authenticating using IAM user credentials for the AWS CLI - AWS Command Line Interface https://docs.aws.amazon.com/cli/latest/userguide/cli-authentication-user.html#cli-authentication-user-configure.title o Enter us-west-2 for default region name
- How to upload files: Using high-level (s3) commands in the AWS CLI - AWS Command Line Interface https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html#using-s3-commands-managing-objects-sync
- How to delete files (or use the sync command with --delete as shown in previous link): Using high-level (s3) commands in the AWS CLI - AWS Command Line Interface https://docs.aws.amazon.com/cli/latest/userguide/cli-services-s3-commands.html#using-s3-commands-delete-objects
- Note: it may take up to 1 hour for the data to be avilalbe to the notebook. Optionally, an MSD-LIVE developer can manually trigger the project deployment's datasync task to run right away.
- Dev project notebooks deployments are only availble internally to the PNNL domain. If not on site at PNNL you must be on the PNNL / Legacy PNNL VPN
- When logging in to the notebook you must use credentials of a user registered to the DEV msdlive site (msdlive.dev.org)
- Deployment for dev are the same steps as above but changes are made to the dev config file and files uploaded to the dev bucket