It can be very useful to outsource processing to the cloud as it allows for easy horizontal and vertical scaling. Google Cloud Platform has all the necessary infrastructure to run Jupyter Notebooks in the cloud, from creating a clusterized server configuration of notebooks, to reading and writing data to a database based on Cloud SQL.
The following are instructions to run a (Python 3, Anaconda3) Jupyter Notebook Server using Google Cloud Platform's Dataproc (for clusterized processing) or Compute Engine (for normal processing), as well as Cloud SQL for storing data. A script is provided for uploading CSV data to Cloud SQL, as well as an example of how to query Cloud SQL into a Pandas DataFrame.
IMPORTANT NOTE: As an addendum to these instructions, sometimes the typical workflow involves large datasets that are not being modified constantly. For this use case, Google BigQuery is a much faster alternative to Cloud SQL. A new section has been added with instructions to setup a BigQuery dataset, and how to query it from Jupyter.
The steps are as follows:
- Setup project
- Setup Cloud SQL
- Setup Jupyter Notebook Server
- Setup SSH tunnel to instance
- Connect to the Jupyter Notebook Server
Download and install Google Cloud SDK. Once installed, the gcloud
command should be usable from any command prompt with an up-to-date PATH variable.
Run some command prompt, or the included Google Cloud SDK Shell and create a project (if not created one yet):
gcloud projects create [PROJECT_ID] --name [PROJECT_NAME]
Set this project as the current, working project:
gcloud config set project [PROJECT_ID]
NOTE: This will remove the need to specify the project on every
gcloud
call.
Run some command prompt, or the included Google Cloud SDK Shell and create a Cloud SQL PostgresQL instance using the following command (where CORES
refers to the number of virtual CPUs and MEMORY
to the allocated virtual memory, in MB):
gcloud beta sql instances create [INSTANCE_NAME] --tier db-custom-[CORES]-[MEMORY] --database-version POSTGRES_9_6
NOTE: The number of cores must be even, memory must be between 1024 MB and 6,656 MB per core, and the total memory needs to be a multiple of 256.
The console output will show a field called ADDRESS
. This is the instance's external IP. Take note of this value.
Go to the Cloud SQL instance's page, select your instance, and go to Databases > Create database and create one.
In order to be able to connect to the instance locally, it is necessary to whitelist the local IP. Search for your IP on Google. This is the easiest way to get your local, external IP. Then, go to Authorization > Add network and enter this IP.
By default, the postgres
user will have all required privileges, but you might want to create separate users. To add a user, use the following command (where HOST
refers to the instance's IP):
gcloud sql users create [USER_NAME] [HOST] --instance [INSTANCE_NAME] --password [PASSWORD]
The newly created user will have limited privileges. To grant additional privileges, you need to go to Overview > Connect to this instance > Connect using Cloud Shell. The Cloud Shell will open with a default connection command available. Press enter to get access to the PSQL client.
Once inside, switch to the database you want:
\c [DATABASE_NAME]
Grant the privileges you want to the desired user. For example, all privileges on all tables:
GRANT ALL PRIVILEGES ON ALL TABLES IN SCHEMA public TO [USER_NAME];
The easiest way to upload a CSV is to use Python. Connect to the Cloud SQL instance using SQLAlchemy, load the CSV using Pandas, and insert it to some table.
Import both Pandas and SQLAlchemy:
import pandas as pd
from sqlalchemy import create_engine
Connect to the Cloud SQL instance using SQLAlchemy:
settings = {
'user': '[USER_NAME]',
'pass': '[PASSWORD]',
'host': '[HOST]',
'db': '[DATABASE_NAME]'
}
url = 'postgresql+psycopg2://{user}:{pass}@{host}:5432/{db}'.format(**settings) # 5432 is the default port
engine = create_engine(url, client_encoding='utf8')
Load the CSV using Pandas:
df = pd.read_csv('[CSV_FILE_NAME].csv') # this returns a DataFrame
NOTE: Make sure that the column names are valid (no spaces, no illegal characters, etc.). A useful trick is replace any character different from a letter, number or underscore to an underscore using regular expressions:
from re import sub
old_columns = list(df.columns) # create list of columns
new_columns = [sub('[^A-Za-z0-9_]+', '_', _) for _ in old_columns] # list comprehension with regex sub
df = df.rename(columns=dict(zip(old_columns, new_columns))) # zip as {old_column: new_column}
NOTE: Avoid using
DataFrame.drop(inplace=True)
since it is buggy in some versions of Pandas.
Finally, insert DataFrame to some table:
df.to_sql('[TABLE_NAME]', con=engine, if_exists='replace')
There are two approaches, one more complex, and one simpler. They both have their respective pros and cons, and will depend on the use case. The most practical approach is to think about using clusterized processing when the task itself is complex and requires lots of resources to process, and normal processing when the task is simpler.
For this guide, you must choose between one or the other.
The Dataproc service allows to create a set of Compute Engine instances with master-slave relationships (in terms of processing). This is the most efficient way to distribute the workload of Jupyter Notebooks.
For this guide, we will use 1 master instance and 2 workers (slaves):
Run some command prompt, or the included Google Cloud SDK Shell and create the Dataproc cluster. Here we have specified the types of machines (i.e. n1-standard-2
) but you can use different ones:
gcloud dataproc clusters create [CLUSTER_NAME] --master-machine-type n1-standard-2 --worker-machine-type n1-standard-2 --initialization-actions gs://srcd-dataproc/jupyter.sh
Check the newly created cluster:
gcloud dataproc clusters list
Check the newly created instances:
gcloud compute instances list
You will see a master instance ([CLUSTER_NAME]-m
) and 2 workers ([CLUSTER_NAME]-w-0
and [CLUSTER_NAME]-w-1
).
A different, simpler, quicker, and cheaper approach is to use a single Compute Engine instance. To do this, run some command prompt, or the included Google Cloud SDK Shell and create an instance (e.g. Debian 9):
gcloud compute instances create [INSTANCE_NAME] --image-family debian-9 --image-project debian-cloud
Check the newly created instance:
gcloud compute instances list
Go to the Compute Engine instance's page, select your instance, scroll down to the Firewalls section and tick both Allow HTTP traffic and Allow HTTPS traffic.
NOTE: If you plan on using Anaconda3, you may skip this next part as Jupyter already comes pre-installed.
Go to the Remote access section and click on SSH. Once the console loads up, install Jupyter.
pip install jupyter
Run some command prompt, or the included Google Cloud SDK Shell and connect to the instance using SSH. If you used Dataproc, the [INSTANCE_NAME]
will refer to the master instance ([CLUSTER_NAME]-m
):
gcloud compute ssh [INSTANCE_NAME] --zone [ZONE]
NOTE: Upon first connection, a message about the host's key might show up. Click on
Yes
to add the key to the local registry when prompted.
Once authenticated, proceed downloading Anaconda3:
sudo wget https://repo.continuum.io/archive/Anaconda3-5.0.0.1-Linux-x86_64.sh
NOTE: You can always visit the Anaconda archive to get any version's URL.
Proceed to install Anaconda3 (install bzip2
to be able to decompress some Anaconda3 installation files):
sudo apt-get install bzip2
bash Anaconda3-5.0.0.1-Linux-x86_64.sh
NOTE: Do not run
sudo bash
for the installation, as it will be installed elsewhere.
NOTE: During the installation, you will be asked if you want to add Anaconda3 to the PATH variable (albeit in very quirky wording). Type
yes
to this step when prompted.
Check if installation is successful by running the ls
command and checking that the anaconda3
folder is present. Then, check if conda
registered to path by running the conda
command. If not recognized, add to path manually:
source ~/.bashrc
The tunnel is what will allow you to run Jupyter Notebooks on the cloud, from your computer.
Go to the External IP Addresses list page and make the Compute Engine instance's IP static. If you used Dataproc, the instance's IP you want to make public is the master instance ([CLUSTER_NAME]-m
). Take note of the Compute Engine instance's new static IP (external address).
Run some command prompt, or the included Google Cloud SDK Shell and connect to the instance using SSH. If you used Dataproc, the [INSTANCE_NAME]
will refer to the master instance ([CLUSTER_NAME]-m
):
gcloud compute ssh [INSTANCE_NAME] --zone [ZONE]
NOTE: Upon first connection, a message about the host's key might show up. Click on
Yes
to add the key to the local registry when prompted.
Once authenticated, proceed to running the Jupyter Notebook Server and exposing port 8888
:
jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser
Once the server is running, it will generate an output similar to the following:
[I NotebookApp] The Jupyter Notebook is running at: http://0.0.0.0:8888/?token=6650c754c8cddf2dd8cee7923a116ad021dfec8fe085c99a
Take note of the value for the token parameter (e.g. 6650c754c8cddf2dd8cee7923a116ad021dfec8fe085c99a
)
Thus far, you have Jupyter running on a Google Cloud instance, on port 8888
. Now, we need to tunnel this port to another port, we will use local port 2222
to interface with remote port 8888
.
Run a new command prompt, or the included Google Cloud SDK Shell and connect to the instance again using SSH. This time, passing a flag to create the tunnel:
gcloud compute ssh [INSTANCE_NAME] --zone [ZONE] --ssh-flag="-L" --ssh-flag="2222:localhost:8888"
Now that there is an open connection that tunnels local port 2222
to remote port 8888
, where there is a Jupyter Notebook Server running, you can simply open some browser (e.g. Google Chrome) and visit localhost:2222
.
Once inside, it will ask for the token. Provide the token that was shown in the output when running the server, and that's it.
If you want to query your Cloud SQL database, you need to whitelist the IP from which the query originates from. We had previously whitelisted our local IP, but now we need to whitelist our Jupyter instance IP so that Jupyter can query Cloud SQL.
To do this, simply go to the Cloud SQL instance's page, select your instance, and go to Authorization > Add network and enter the instance's IP.
Once the instance has permission, inside a Jupyter Notebook, import both Pandas and SQLAlchemy:
import pandas as pd
from sqlalchemy import create_engine
Install psycopg2
within Jupyter (code that starts with a !
in Jupyter executes bash commands). This is necessary to query SQL from Pandas:
!pip install psycopg2
Connect to the Cloud SQL instance using SQLAlchemy:
settings = {
'user': '[USER_NAME]',
'pass': '[PASSWORD]',
'host': '[HOST]',
'db': '[DATABASE_NAME]'
}
url = 'postgresql+psycopg2://{user}:{pass}@{host}:5432/{db}'.format(**settings) # 5432 is the default port
engine = create_engine(url, client_encoding='utf8')
Create some query string, for example:
query = """
SELECT *
FROM [TABLE_NAME]
"""
Perform query and store results in DataFrame:
df = pd.read_sql(query, con=engine)
See example/query-sql-pandas.py
Setting up BigQuery is much easier than Cloud SQL because it does not require to create an instance. Simply run a new command prompt, or the included Google Cloud SDK Shell and create a dataset:
bq mk [DATASET_NAME]
To upload a CSV, it is only required to call the load
command and specify the insertion details within the folder that contains the CSV file:
bq load --autodetect [DATASET_NAME].[TABLE_NAME] [CSV_FILE_NAME]
NOTE: If the CSV is too large and the connection must be kept alive for too long, it's better to use Google Cloud Storage for staging and then load the CSV from there. See the next section for details.
Run a new command prompt, or the included Google Cloud SDK Shell, and create the bucket which will store the CSV file(s):
gsutil mb gs://[BUCKET_NAME]/
Navigate to the folder that contains the CSV file, and execute the following command (where BUCKET_NAME
refers to the name of the bucket which will receive the file):
gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp [CSV_FILE_NAME] gs://[BUCKET_NAME]/
Run the following command to load the CSV from the bucket to BigQuery:
bq load --autodetect [DATASET_NAME].[TABLE_NAME] gs://[BUCKET_NAME]/[CSV_FILE_NAME]
To query BigQuery from Jupyter, you can use Pandas. However, there is an extra package that needs to be installed for Pandas to interface with BigQuery called pandas-gbq
. Install this first:
pip install pandas-gbq
Then, import Pandas:
import pandas as pd
Create some query string (using the proper FROM
syntax, surrounded by brackets []
), for example:
query = """
SELECT *
FROM [[PROJECT_ID]:[DATASET_NAME].[TABLE_NAME]]
"""
NOTE: The project ID specified in the
FROM
refers to the project which contains the dataset.
Then, simply run the query using the SDK authentication (the cell output will send you a link for you to copy and paste some ID) and an accessible project ID:
df = pd.read_gbq(query, project_id='PROJECT_ID')
NOTE: The project ID specified in the
project_id
parameter refers to a project which the authenticated user has permissions over.