course materials for Data Science at Scale
Sign up for a new github account
Fork the dsc402 repositiory into your new account. Note: this will create a copy of the course repo for you to add and work on within your
own account.
Goto https://github.com/lpalum/dscc202-402-spring2022 and hit the fork button while you are logged into your github account:
git clone https://github.com/[your account name]/dscc202-402-spring2022.git
note: you may want to clone this repo into a dirtory on your machine that you organize for code e.g. /home//code/github
note: /home/[your account name] should be /Users/[your account name] to work with the paths that are defined in Mac OS X.
Databrick Community Edition FAQ
Note: you will also be receiving an email invite to the class shared Databricks Workspace which is where you will be doing your group final project.
Here is some helpful information about importing archives into to the Databricks Envioronment: https://docs.databricks.com/notebooks/notebooks-manage.html#import-a-notebook
import the DBC archive from the Learning Spark v2 github repositiory into your account. DBC Archive
Install docker on your computer
Pull the all-spark-notebook image from docker hub:
https://hub.docker.com/r/lpalum/dsc402
Launch the docker image to open a Jupyter Lab instance in your local browser:
docker run -it --rm -p 8888:8888 --name all-spark --volume /home/[your account name]/code/github:/home/jovyan/work lpalum/dsc402 start.sh jupyter lab
This will start a jupyter lab instance on your machine that you will be able to access at port 8888 in your browser and it will mount the github repo that you previouly cloned into the containers working directory.