diff --git a/tables/automl/notebooks/census_income_prediction/getting_started_notebook.ipynb b/tables/automl/notebooks/census_income_prediction/getting_started_notebook.ipynb index a536535bf2d6..5838669517f4 100644 --- a/tables/automl/notebooks/census_income_prediction/getting_started_notebook.ipynb +++ b/tables/automl/notebooks/census_income_prediction/getting_started_notebook.ipynb @@ -3,7 +3,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "f5r7tJESsB65" + }, "outputs": [], "source": [ "# Copyright 2019 Google LLC\n", @@ -23,24 +27,21 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "SwKhClWLsSou" + }, "source": [ - "# Getting started with AutoML Tables\n", + "# **Getting Started with AutoML Tables**\n", "\n", "\n", " \n", - " \n", "
\n", - " \n", - " \"Google Read on cloud.google.com\n", - " \n", - " \n", - " \n", + " \n", " \"Colab Run in Colab\n", " \n", " \n", - " \n", + " \n", " \"GitHub\n", " View on GitHub\n", " \n", @@ -52,139 +53,260 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "b--5FDDwCG9C" + "id": "SATX51N8tFga" }, "source": [ - "## Overview\n", - "\n", - "[Google’s AutoML](https://cloud.google.com/automl-tables/) provides the ability for software engineers to build high quality models without the need to know how to build, train models, or deploy/serve models on the cloud. Instead, one only needs to know about dataset curation, evaluating results, and the how-to steps.\n", - "\n", - "\"AutoML\n", + "## **Overview**\n", + "[Google’s AutoML](https://cloud.google.com/automl-tables/) provides the ability for software engineers to build high quality models without the need to know how to build, train models, or deploy/serve models on the cloud. Instead, one only needs to know about dataset curation, evaluating results, and the how-to steps." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0h9L9fpts327" + }, + "source": [ + "![alt text](https://camo.githubusercontent.com/8d5e7fe8fadc1883bf55b4d561d9b68fced463bf/68747470733a2f2f636c6f75642e676f6f676c652e636f6d2f696d616765732f6175746f6d6c2d7461626c65732f6175746f6d6c2d7461626c652e737667)\n", "\n", - "AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. AutoML Tables uses tabular (structured) data to train a machine learning model to make predictions on new data. One column from your dataset, called the target, is what your model will learn to predict. Some number of the other data columns are inputs (called features) that the model will learn patterns from. \n", + "AutoML Tables is a supervised learning service. This means that you train a machine learning model with example data. AutoML Tables uses tabular (structured) data to train a machine learning model to make predictions on new data. One column from your dataset, called the target, is what your model will learn to predict. Some number of the other data columns are inputs (called features) that the model will learn patterns from.\n", "\n", "In this notebook, we will use the [Google Cloud SDK AutoML Python API](https://cloud.google.com/automl-tables/docs/client-libraries) to create a binary classification model using a real dataset from the [Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income).\n", "\n", - "We will provide the training and evaluation dataset, once dataset is created we will use AutoML API to create the model and then perform predictions to predict if a given individual has an income above or below 50k, given information like the person's age, education level, marital-status, occupation etc... \n", + "We will provide the training and evaluation dataset, once dataset is created we will use AutoML API to create the model and then perform predictions to predict if a given individual has an income above or below 50k, given information like the person's age, education level, marital-status, occupation etc...\n", "\n", - "For setting up a Google Cloud Platform (GCP) account for using AutoML, please see the online documentation for [Getting Started](https://cloud.google.com/automl-tables/docs/quickstart).\n" + "For setting up a Google Cloud Platform (GCP) account for using AutoML, please see the online documentation for [Getting Started](https://cloud.google.com/automl-tables/docs/quickstart)." ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "gKoQObu-s1gT" + }, "source": [ - "### Dataset\n", - "\n", - "This tutorial uses the [United States Census Income\n", - "Dataset](https://archive.ics.uci.edu/ml/datasets/census+income) provided by the\n", - "[UC Irvine Machine Learning\n", - "Repository](https://archive.ics.uci.edu/ml/index.php)containing information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year. The dataset consists of over 30k rows, where each row corresponds to a different person. For a given row, there are 14 features that the model conditions on to predict the income of the person. A few of the features are named above, and the exhaustive list can be found both in the dataset link above." + "### **Dataset**" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "AZs0ICgy4jkQ" + "id": "Ab_lae0MURrk" }, "source": [ - "## Before you begin\n", - "\n", - "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to:\n", - "* Create a Google Cloud Platform (GCP) project.\n", - "* Enable billing.\n", - "* Apply to whitelist your project.\n", - "* Enable AutoML API.\n", - "\n", - "You also need to upload your data into [Google Cloud Storage](https://cloud.google.com/storage/) (GCS) or [BigQuery](https://cloud.google.com/bigquery/). \n", - "For example, to use GCS as your data source:\n", - "\n", - "* [Create a GCS bucket](https://cloud.google.com/storage/docs/creating-buckets).\n", - "* Upload the training and batch prediction files." + "This tutorial uses the [United States Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/census+income) provided by the [UC Irvine Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php) containing information about people from a 1994 Census database, including age, education, marital status, occupation, and whether they make more than $50,000 a year. The dataset consists of over 30k rows, where each row corresponds to a different person. For a given row, there are 14 features that the model conditions on to predict the income of the person. A few of the features are named above, and the exhaustive list can be found both in the dataset link above." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "xZECt1oL429r" + "id": "c4Tj5-KesxSs" }, "source": [ - "\n", - "\n", - "---\n", - "\n" + "### **Costs**" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "eX3UU0KDP0Un" + }, "source": [ - "## Instructions\n", "\n", - "You must do several things before you can train and deploy a model in\n", - "AutoML:\n", + "This tutorial uses billable components of Google Cloud Platform (GCP):\n", "\n", + "* Cloud AI Platform\n", + "* Cloud Storage\n", + "* AutoML Tables\n", "\n", - " * Set up your local development environment (optional)\n", - " * Set Project ID and Compute Region\n", - " * Authenticate your GCP account\n", - " * Import Python API SDK and create a Client instance,\n", - " * Create a dataset instance and import the data.\n", - " * Create a model instance and train the model.\n", - " * Evaluating the trained model.\n", - " * Deploy the model on the cloud for online predictions.\n", - " * Make online predictions.\n", - " * Undeploy the model\n" + "Learn about [Cloud AI Platform pricing](https://cloud.google.com/ml-engine/docs/pricing),\n", + "[Cloud Storage pricing](https://cloud.google.com/storage/pricing),\n", + "[AutoML Tables pricing](https://cloud.google.com/automl-tables/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "wLEZGISqBshz" + }, "source": [ - "### Set up your local development environment\n", + "## **Set up your local development environment**\n", "\n", "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", - "all the requirements to run this notebook. You can skip this step." + "all the requirements to run this notebook. If you are using **AI Platform Notebook**, make sure the machine configuration type is **1 vCPU, 3.75 GB RAM** or above. You can skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "fowx0guYB-mX" + }, + "source": [ + "**Otherwise**, make sure your environment meets this notebook's requirements.\n", + "You need the following:\n", + "\n", + "* The Google Cloud SDK\n", + "* Git\n", + "* Python 3\n", + "* virtualenv\n", + "* Jupyter notebook running in a virtual environment with Python 3\n", + "\n", + "The Google Cloud guide to [Setting up a Python development\n", + "environment](https://cloud.google.com/python/setup) and the [Jupyter\n", + "installation guide](https://jupyter.org/install) provide detailed instructions\n", + "for meeting these requirements. The following steps provide a condensed set of\n", + "instructions:\n", + "\n", + "1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)\n", + "\n", + "2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)\n", + "\n", + "3. [Install\n", + " virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)\n", + " and create a virtual environment that uses Python 3.\n", + "\n", + "4. Activate that environment and run `pip install jupyter` in a shell to install\n", + " Jupyter.\n", + "\n", + "5. Run `jupyter notebook` in a shell to launch Jupyter.\n", + "\n", + "6. Open this notebook in the Jupyter Notebook Dashboard." ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "hdzNyF4iCdNI" + }, "source": [ - "### Set up your GCP project\n", + "## **Set up your GCP project**\n", "\n", "**The following steps are required, regardless of your notebook environment.**\n", "\n", - "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager)\n", + "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", "\n", "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", "\n", - "3. [Enable the AutoML API (\"AutoML API\")](https://console.cloud.google.com/flows/enableapi?apiid=automl.googleapis.com)\n", + "3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n", "\n", - "4. Enter your project ID in the cell below. Then run the cell to make sure the\n", - "Cloud SDK uses the right project for all the commands in this notebook.\n", + "4. [Enable AutoML API.](https://console.cloud.google.com/apis/library/automl.googleapis.com?q=automl)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Ac3NIGCMVF9x" + }, + "source": [ + "### **PIP Install Packages and dependencies**\n", "\n", - "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." + "Install addional dependencies not installed in Notebook environment\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "3jK7RbsFVHBg" + }, + "outputs": [], + "source": [ + "# Use the latest major GA version of the framework.\n", + "! pip install --upgrade --quiet --user --user google-cloud-automl" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "kK5JATKPNf3I" + }, + "source": [ + "**Note:** Try installing using `sudo`, if the above command throw any permission errors." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "A37ofoNkR-L7" + }, + "source": [ + "`Restart` the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "vAxYYE3bTr1A" + }, + "outputs": [ + { + "data": { + "text/html": [ + "" + ], + "text/plain": [ + "" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from IPython.core.display import HTML\n", + "HTML(\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3Snl2Ja75qMM" + }, + "source": [ + "## **Set up your GCP Project Id**\n", + "\n", + "Enter your `Project Id` in the cell below. Then run the cell to make sure the\n", + "Cloud SDK uses the right project for all the commands in this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "dkz6SRiMCfSX" + }, "outputs": [], "source": [ - "PROJECT_ID = \"\" # @param {type:\"string\"}\n", - "COMPUTE_REGION = \"us-central1\" # Currently only supported region.\n", - "! gcloud config set project $PROJECT_ID" + "PROJECT_ID = \"[your-project-id]\" #@param {type:\"string\"}\n", + "COMPUTE_REGION = \"us-central1\" # Currently only supported region." ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "dr--iN2kAylZ" + }, "source": [ - "### Authenticate your GCP account\n", + "## **Authenticate your GCP account**\n", "\n", "**If you are using AI Platform Notebooks**, your environment is already\n", "authenticated. Skip this step." @@ -192,12 +314,12 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "3yyVCJHFSEKG" + }, "source": [ - "**If you are using Colab**, run the cell below and follow the instructions\n", - "when prompted to authenticate your account via oAuth.\n", - "\n", - "**Otherwise**, follow these steps:\n", + "Otherwise, follow these steps:\n", "\n", "1. In the GCP Console, go to the [**Create service account key**\n", " page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n", @@ -211,46 +333,48 @@ " **Storage > Storage Object Admin**.\n", "\n", "5. Click *Create*. A JSON file that contains your key downloads to your\n", - "local environment.\n", - "\n", - "6. Enter the path to your service account key as the\n", - "`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell." + "local environment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Yt6PhVG0UdF1" + }, + "source": [ + "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "q5TeVHKDMOJF" + }, "outputs": [], "source": [ + "# Upload the downloaded JSON file that contains your key.\n", "import sys\n", "\n", - "# If you are running this notebook in Colab, run this cell and follow the\n", - "# instructions to authenticate your GCP account. This provides access to your\n", - "# Cloud Storage bucket and lets you submit training jobs and prediction\n", - "# requests.\n", - "\n", "if 'google.colab' in sys.modules: \n", " from google.colab import files\n", " keyfile_upload = files.upload()\n", " keyfile = list(keyfile_upload.keys())[0]\n", " %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n", - "# If you are running this notebook locally, replace the string below with the\n", - "# path to your service account key and run this cell to authenticate your GCP\n", - "# account.\n", - "else:\n", - " %env GOOGLE_APPLICATION_CREDENTIALS /path/to/service_account.json" + " ! gcloud auth activate-service-account --key-file $keyfile" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "BR0POq2UzE7e" + "id": "d1bnPeDVMR5Q" }, "source": [ - "### Install the client library\n", - "Run the following cell." + "***If you are running the notebook locally***, enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell" ] }, { @@ -259,184 +383,311 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "43aXKjDRt_qZ" + "id": "fsVNKXESYoeQ" }, "outputs": [], "source": [ - "%pip install google-cloud-automl" + "# If you are running this notebook locally, replace the string below with the\n", + "# path to your service account key and run this cell to authenticate your GCP\n", + "# account.\n", + "\n", + "%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account\n", + "! gcloud auth activate-service-account --key-file '/path/to/service/account'" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "zgPO1eR3CYjk" + }, "source": [ - "### Import libraries and define constants\n", + "## **Create a Cloud Storage bucket**\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "When you submit a training job using the Cloud SDK, you upload a Python package\n", + "containing your training code to a Cloud Storage bucket. AI Platform runs\n", + "the code from this package. In this tutorial, AI Platform also saves the\n", + "trained model that results from your job in the same bucket. You can then\n", + "create an AI Platform model version based on this output in order to serve\n", + "online predictions.\n", "\n", - "First, import Python libraries required for training,\n", - "The code example below demonstrates importing the AutoML Python API module into a python script. " + "Set the name of your Cloud Storage bucket below. It must be unique across all\n", + "Cloud Storage buckets. \n", + "\n", + "You may also change the `REGION` variable, which is used for operations\n", + "throughout the rest of this notebook. Make sure to [choose a region where Cloud\n", + "AI Platform services are\n", + "available](https://cloud.google.com/ml-engine/docs/tensorflow/regions). You may\n", + "not use a Multi-Regional Storage bucket for training with AI Platform." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "cellView": "both", + "colab": {}, + "colab_type": "code", + "id": "MzGDU7TWdts_" + }, "outputs": [], "source": [ - "# AutoML library\n", - "from google.cloud import automl_v1beta1 as automl\n", - "\n", - "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", - "import matplotlib.pyplot as plt" + "BUCKET_NAME = \"[your-bucket-name]\" #@param {type:\"string\"}" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "-EcIXiGsCePi" + }, "source": [ - "## Quickstart for AutoML tables\n", - "\n", - "This section of the tutorial walks you through creating an AutoML Tables client." + "**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled" ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "NIq7R4HZCfIc" + }, + "outputs": [], "source": [ - "Additionally, one will want to create an instance to the TablesClient. \n", - "This client instance is the HTTP request/response interface between the python script and the GCP AutoML service." + "! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "ucvCsknMCims" + }, "source": [ - "### Create API Client to AutoML Service*\n", - "\n", - "**If you are using AI Platform Notebooks**, or *Colab* environment is already\n", - "authenticated using GOOGLE_APPLICATION_CREDENTIALS. Run this step." + "Finally, validate access to your Cloud Storage bucket by examining its contents:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "vhOb7YnwClBb" + }, "outputs": [], "source": [ - "client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)" + "! gsutil ls -al gs://$BUCKET_NAME" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "rUlBcZ3OfWcJ" + "id": "SDhrFSBHWkgl" }, "source": [ - "**If you are using Colab or Jupyter**, and you have defined a service account\n", - "follow the following steps to create the AutoML client\n", - "\n", - "You can see a different way to create the API Clients using service account." + "## **Import libraries and define constants**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "10-QDqYIWw6w" + }, + "source": [ + "Import relevant packages." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "hztZMQ-1WlQE" + }, "outputs": [], "source": [ - "# from google.oauth2 import service_account\n", - "# credentials = service_account.Credentials.from_service_account_file('/path/to/service_account.json')\n", - "# client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION, credentials=credentials)" + "from __future__ import absolute_import\n", + "from __future__ import division\n", + "from __future__ import print_function" ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Uw6hy3ufXFaE" + }, + "outputs": [], "source": [ - "---" + "# AutoML library.\n", + "from google.cloud import automl_v1beta1 as automl\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "znEditA8uMgi" + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "from ipywidgets import interact\n", + "import ipywidgets as widgets" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "WSSiBpttCCrZ" + }, "source": [ - "List datasets in your project:" + "Populate the following cell with the necessary constants and run it to initialize constants." ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "cellView": "both", "colab": {}, "colab_type": "code", - "id": "sf32nKXIqYje" + "id": "V41T2eEVBCbh" }, "outputs": [], "source": [ - "# List datasets in Project\n", - "list_datasets = client.list_datasets()\n", - "datasets = { dataset.display_name: dataset.name for dataset in list_datasets }\n", - "datasets" + "#@title Constants { vertical-output: true }\n", + "\n", + "# A name for the AutoML tables Dataset to create.\n", + "DATASET_DISPLAY_NAME = 'census' #@param {type: 'string'}\n", + "# The GCS data to import data from (doesn't need to exist).\n", + "INPUT_CSV_NAME = 'census_income' #@param {type: 'string'}\n", + "# A name for the AutoML tables model to create.\n", + "MODEL_DISPLAY_NAME = 'census_income_model' #@param {type: 'string'}\n", + "\n", + "assert all([\n", + " PROJECT_ID,\n", + " COMPUTE_REGION,\n", + " DATASET_DISPLAY_NAME,\n", + " INPUT_CSV_NAME,\n", + " MODEL_DISPLAY_NAME,\n", + "])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "t9uE8MvMkOPd" + "id": "YaErGUWMCA26" }, "source": [ - "You can also print the list of your models by running the following cell." + "Initialize client for AutoML and AutoML Tables" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "cellView": "both", "colab": {}, "colab_type": "code", - "id": "j4-bYRSWj7xk" + "id": "h34EOO9QC6-D" }, "outputs": [], "source": [ - "list_models = client.list_models()\n", - "models = { model.display_name: model.name for model in list_models }\n", - "models" + "# Initialize the clients.\n", + "automl_client = automl.AutoMlClient()\n", + "tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "qozQWMnOu48y" + "id": "NB4GVL3hbHZV" }, "source": [ + "## **Test the set up**\n", "\n", + "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n", "\n", - "---\n", - "\n" + "If no dataset has previously imported into AutoML Tables, you shall expect an empty return." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "hNh4IfWVbJZl" + }, + "outputs": [], + "source": [ + "# List the datasets.\n", + "list_datasets = tables_client.list_datasets()\n", + "datasets = { dataset.display_name: dataset.name for dataset in list_datasets }\n", + "datasets" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "XwjZc9Q62Fm5" + "id": "I7nyfefWba32" }, "source": [ - "### Create a dataset" + "You can also print the list of your models by running the following cell.\n", + "​\n", + "If no model has previously trained using AutoML Tables, you shall expect an empty return." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "aVOPbJaZbc5o" + }, + "outputs": [], + "source": [ + "# List the models.\n", + "list_models = tables_client.list_models()\n", + "models = { model.display_name: model.name for model in list_models }\n", + "models" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "WvYxMMLVdYmd" + }, + "source": [ + "## **Import training data**" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "_JfZFGSceyE_" + "id": "3toFtz7xb5Ws" }, "source": [ - "Now we are ready to create a dataset instance (on GCP) using the client method create_dataset(). This method has one required parameter, the human readable display name `dataset_display_name`.\n", + "### **Create dataset**\n", + "Now we are ready to create a dataset instance (on GCP) using the client method `create_dataset()`. This method has one required parameter, the human readable display name `DATASET_DISPLAY_NAME`.\n", "\n", "Select a dataset display name and pass your table source information to create a new dataset." ] @@ -447,14 +698,14 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "Z_JErW3cw-0J" + "id": "4UYVvFn9b1uQ" }, "outputs": [], "source": [ - "# Create dataset\n", - "\n", - "dataset_display_name = 'census' \n", - "dataset = client.create_dataset(dataset_display_name)\n", + "# Create dataset.\n", + "dataset = tables_client.create_dataset(\n", + " dataset_display_name=DATASET_DISPLAY_NAME)\n", + "dataset_name = dataset.name\n", "dataset" ] }, @@ -462,40 +713,45 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "35YZ9dy34VqJ" + "id": "3CLctOh7dzcp" }, "source": [ - "### Import data" + "### **Import data**" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "3c0o15gVREAw" + "id": "Y7HLiinLd8mE" }, "source": [ - "You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the [census_income dataset](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv) \n", - "as your training data. We provide code below to copy the data into a bucket you own automatically. You are free to adjust the value of `GCS_STORAGE_BUCKET` as needed." + "You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the [census_income dataset](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv) as your training data. We provide code below to copy the data into a bucket you own automatically. You are free to adjust the value of `BUCKET_NAME` as needed." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ECklv3hAd0F6" + }, "outputs": [], "source": [ - "GCS_STORAGE_BUCKET = 'gs://{}-codelab-data-storage'.format(PROJECT_ID)\n", - "GCS_DATASET_URI = '{}/census_income.csv'.format(GCS_STORAGE_BUCKET)\n", - "! gsutil ls $GCS_STORAGE_BUCKET || gsutil mb -l $COMPUTE_REGION $GCS_STORAGE_BUCKET\n", + "GCS_DATASET_URI = 'gs://{}/{}.csv'.format(BUCKET_NAME, INPUT_CSV_NAME)\n", + "! gsutil ls gs://$BUCKET_NAME || gsutil mb -l $COMPUTE_REGION gs://$BUCKET_NAME\n", "! gsutil cp gs://cloud-ml-data-tables/notebooks/census_income.csv $GCS_DATASET_URI" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "MvMwm0W_hX9I" + }, "source": [ - "Import data into the dataset, this process may take a while, depending on your data, once completed, you can verify the status by printing the dataset object. This time pay attention to the example_count field with 32561 records." + "Import data into the dataset, this process may take a while, depending on your data, once completed, you can verify the status by printing the dataset object. This time pay attention to the example_count field with 32461 records." ] }, { @@ -504,19 +760,22 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "FNVYfpoXJsNB" + "id": "cLCqyBHLhagU" }, "outputs": [], "source": [ - "import_data_operation = client.import_data(\n", + "# Read the data source from GCS. \n", + "import_data_response = tables_client.import_data(\n", " dataset=dataset,\n", " gcs_input_uris=GCS_DATASET_URI\n", ")\n", - "print('Dataset import operation: {}'.format(import_data_operation))\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))\n", "\n", "# Synchronous check of operation status. Wait until import is done.\n", - "import_data_operation.result()\n", - "dataset = client.get_dataset(dataset_name=dataset.name)\n", + "print('Dataset import response: {}'.format(import_data_response.result()))\n", + "\n", + "# Verify the status by checking the example_count field.\n", + "dataset = tables_client.get_dataset(dataset_name=dataset_name)\n", "dataset" ] }, @@ -524,10 +783,12 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "QdxBI4s44ZRI" + "id": "Hm-nfjv8htja" }, "source": [ - "### Review the data specs" + "## **Review the specs**\n", + "\n", + "Run the following command to see table specs such as row count." ] }, { @@ -536,21 +797,21 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "v2Vzq_gwXxo-" + "id": "nlpqBWuQhwnm" }, "outputs": [], "source": [ - "# List table specs\n", - "list_table_specs_response = client.list_table_specs(dataset=dataset)\n", + "# List table specs.\n", + "list_table_specs_response = tables_client.list_table_specs(dataset=dataset)\n", "table_specs = [s for s in list_table_specs_response]\n", "\n", - "# List column specs\n", - "list_column_specs_response = client.list_column_specs(dataset=dataset)\n", + "# List column specs.\n", + "list_column_specs_response = tables_client.list_column_specs(dataset=dataset)\n", "column_specs = {s.display_name: s for s in list_column_specs_response}\n", "\n", - "# Print Features and data_type:\n", - "\n", - "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) for key, value in column_specs.items()]\n", + "# Print Features and data_type.\n", + "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) \n", + " for key, value in column_specs.items()]\n", "print('Feature list:\\n')\n", "for feature in features:\n", " print(feature[0],':', feature[1])" @@ -559,61 +820,54 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# Table schema pie chart.\n", - "\n", - "type_counts = {}\n", - "for column_spec in column_specs.values():\n", - " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", - " type_counts[type_name] = type_counts.get(type_name, 0) + 1\n", - " \n", - "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", - "plt.axis('equal')\n", - "plt.show()" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "FNykW_YOYt6d" - }, - "source": [ - "___" - ] - }, - { - "cell_type": "markdown", "metadata": { - "colab_type": "text", - "id": "kNRVJqVOL8h3" + "colab": {}, + "colab_type": "code", + "id": "XeSCFNxiiZqI" }, + "outputs": [], "source": [ - "### Update dataset: assign a label column and enable nullable columns" + "# Table schema pie chart.\n", + "type_counts = {}\n", + "for column_spec in column_specs.values():\n", + " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", + " type_counts[type_name] = type_counts.get(type_name, 0) + 1\n", + " \n", + "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", + "plt.axis('equal')\n", + "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "-57gehId9PQ5" + "id": "8M4t4kAjjC4h" }, "source": [ + "## **Update dataset: assign a label column and enable nullable columns**\n", "This section is important, as it is where you specify which column (meaning which feature) you will use as your label. This label feature will then be predicted using all other features in the row.\n", "\n", - "AutoML Tables automatically detects your data column type. For example, for the ([census_income](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv)) it detects `income_bracket` to be categorical (as it is just either over or under 50k) and `age` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." + "AutoML Tables automatically detects your data column type. For example, for the ([census_income](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv)) it detects `income_bracket` to be categorical (as it is just either over or under 50k) and age to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "iRqdQ7Xiq04x" + "id": "HCHwwp6w0V0g" }, "source": [ - "#### Update a column: Set to nullable" + "### **Update a column: Set nullable parameter**" ] }, { @@ -622,14 +876,17 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "OCEUIPKegWrf" + "id": "l_02sFLqkAVN" }, "outputs": [], "source": [ - "update_column_response = client.update_column_spec(\n", + "column_spec_display_name = 'income' #@param {type:'string'}\n", + "type_code='CATEGORY' #@param {type:'string'}\n", + "\n", + "update_column_response = tables_client.update_column_spec(\n", " dataset=dataset,\n", - " column_spec_display_name='income',\n", - " type_code='CATEGORY',\n", + " column_spec_display_name=column_spec_display_name,\n", + " type_code=type_code,\n", " nullable=False,\n", ")\n", "update_column_response" @@ -639,20 +896,20 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "GUqKi3tkqrgW" + "id": "hzwIuNlJkKUI" }, "source": [ - "**Tip:** You can use `'type_code': 'CATEGORY'` in the preceding `update_column_spec_dict` to convert the column data type from `FLOAT64` `to `CATEGORY`." + "**Tip:** You can use `'type_code': 'CATEGORY'` in the preceding `update_column_spec_dict` to convert the column data type from `FLOAT64 to CATEGORY`." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "nDMH_chybe4w" + "id": "T6eM-Vnr0eIf" }, "source": [ - "#### Update dataset: Assign a label" + "### **Update dataset: Assign a label**" ] }, { @@ -661,13 +918,15 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "hVIruWg0u33t" + "id": "oUoisUpSkXbz" }, "outputs": [], "source": [ - "update_dataset_response = client.set_target_column(\n", + "column_spec_display_name = 'income' #@param {type:'string'}\n", + "\n", + "update_dataset_response = tables_client.set_target_column(\n", " dataset=dataset,\n", - " column_spec_display_name='income',\n", + " column_spec_display_name=column_spec_display_name,\n", ")\n", "update_dataset_response" ] @@ -676,38 +935,32 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "z23NITLrcxmi" - }, - "source": [ - "___" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "FcKgvj1-Tbgj" + "id": "5jC3dgRwfNfA" }, "source": [ - "### Creating a model" + "## **Creating a model**" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "Pnlk8vdQlO_k" + "id": "SRqzumCpmN6l" }, "source": [ + "### **Train a Model**\n", + "\n", "Once we have defined our datasets and features we will create a model.\n", "\n", - "Specify the duration of the training. For example, `train_budget_milli_node_hours=1000` runs the training for one hour. \n", + "Specify the duration of the training. For example, `'train_budget_milli_node_hours': 1000` runs the training for one hour. You can increase that number up to a maximum of 72 hours `('train_budget_milli_node_hours': 72000)` for the best model performance.\n", + "\n", + "Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours\n", "\n", "If your Colab times out, use `client.list_models()` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model.\n", "\n", - "```python\n", - " model = client.get_model(model_display_name=model_display_name) \n", - "```" + " model = tables_client.get_model(model_display_name=MODEL_DISPLAY_NAME)\n", + "\n", + "You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by using default optimization objective. Refer [link](https://cloud.google.com/automl-tables/docs/train#opt-obj) for more details. " ] }, { @@ -716,50 +969,53 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "11izNd6Fu37N" + "id": "ps7B_UuMmW4Y" }, "outputs": [], "source": [ - "model_display_name = 'census_income_model'\n", + "# The number of hours to train the model.\n", + "model_train_hours = 1 #@param {type:'integer'}\n", "\n", - "create_model_response = client.create_model(\n", - " model_display_name,\n", + "create_model_response = tables_client.create_model(\n", + " model_display_name=MODEL_DISPLAY_NAME,\n", " dataset=dataset,\n", - " train_budget_milli_node_hours=1000,\n", + " train_budget_milli_node_hours=model_train_hours*1000,,\n", + " exclude_column_spec_names=['fnlwgt','income'],\n", ")\n", - "print('Create model operation: {}'.format(create_model_response.operation))\n", - "# Wait until model training is done.\n", - "model = create_model_response.result()\n", - "model" + "\n", + "operation_id = create_model_response.operation.name\n", + "\n", + "print('Create model operation: {}'.format(create_model_response.operation))" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "1wS1is9IY5nK" + "colab": {}, + "colab_type": "code", + "id": "JoBJO8wtIaio" }, + "outputs": [], "source": [ - "___" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Model deployment" + "# Wait until model training is done.\n", + "model = create_model_response.result()\n", + "model_name = model.name\n", + "model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "n0lFAIkISf4K" + "id": "pU9iZ3MDnfjw" }, "source": [ - "**Important** : Deploy the model, then wait until the model FINISHES deployment.\n", + "## **Model deployment**\n", + "\n", + "**Important :** Deploy the model, then wait until the model FINISHES deployment.\n", "\n", - "The model takes a while to deploy online. When the deployment code response = client.deploy_model(model_name=model.name) finishes, you will be able to see this on the UI. Check the [UI](https://console.cloud.google.com/automl-tables?_ga=2.255483016.-1079099924.1550856636) and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell.You should see \"online prediction\" text near the top, click on it, and it will take you to a view of your online prediction interface. You should see \"model deployed\" on the far right of the screen if the model is deployed, or a \"deploying model\" message if it is still deploying. " + "The model takes a while to deploy online. When the deployment code `response = client.deploy_model(model_name=model.name)` finishes, you will be able to see this on the UI. Check the [UI](https://console.cloud.google.com/automl-tables?_ga=2.255483016.-1079099924.1550856636) and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell.You should see \"online prediction\" text near the top, click on it, and it will take you to a view of your online prediction interface. You should see \"model deployed\" on the far right of the screen if the model is deployed, or a \"deploying model\" message if it is still deploying. " ] }, { @@ -768,27 +1024,34 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "kRoHFbVnSk05" + "id": "t3b-fUQMnXI0" }, "outputs": [], "source": [ - "client.deploy_model(model=model).result()" + "tables_client.deploy_model(model=model).result()" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "jDxcqNl8oLuo" + }, "source": [ - "Verify if model has been deployed, check `deployment_state` field, it should show: `DEPLOYED`" + "Verify if model has been deployed, check deployment_state field, it should show: DEPLOYED" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Ln-ptzgWoMbF" + }, "outputs": [], "source": [ - "model = client.get_model(model_name=model.name)\n", + "model = tables_client.get_model(model_name=model_name)\n", "model" ] }, @@ -796,7 +1059,7 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "0tymBrhLSnDX" + "id": "ZzVMSahBoVKb" }, "source": [ "Run the prediction, only after the model finishes deployment" @@ -806,23 +1069,43 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "LMYmHSiCE8om" + "id": "oHBcyEbZoj98" }, "source": [ - "### Make an Online prediction" + "## **Make an Online prediction**\n", + "\n", + "You can toggle exactly which values you want for all of the numeric features, and choose from the drop down windows which values you want for the categorical features.\n", + "\n", + "Note: If the model has not finished deployment, the prediction will NOT work. The following cells show you how to make an online prediction." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "G2WVbMFll96k" + "colab": {}, + "colab_type": "code", + "id": "8tnRnMuHoWtl" }, + "outputs": [], "source": [ - "You can toggle exactly which values you want for all of the numeric features, and choose from the drop down windows which values you want for the categorical features.\n", - "\n", - "Note: If the model has not finished deployment, the prediction will NOT work.\n", - "The following cells show you how to make an online prediction. " + "workclass_ids = ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov',\n", + " 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']\n", + "education_ids = ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school',\n", + " 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters',\n", + " '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']\n", + "marital_status_ids = ['Married-civ-spouse', 'Divorced', 'Never-married',\n", + " 'Separated', 'Widowed', 'Married-spouse-absent', \n", + " 'Married-AF-spouse']\n", + "occupation_ids = ['Tech-support', 'Craft-repair', 'Other-service', 'Sales', \n", + " 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', \n", + " 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', \n", + " 'Transport-moving', 'Priv-house-serv', 'Protective-serv', \n", + " 'Armed-Forces']\n", + "relationship_ids = ['Wife', 'Own-child', 'Husband', 'Not-in-family', \n", + " 'Other-relative', 'Unmarried']\n", + "race_ids = ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other',\n", + " 'Black']" ] }, { @@ -831,78 +1114,184 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "yt-KXEEQu3-U" + "id": "0UwPipiHKM5X" }, "outputs": [], "source": [ - "#@title Make an online prediction: set the categorical variables{ vertical-output: true }\n", - "from ipywidgets import interact\n", - "import ipywidgets as widgets\n", - "\n", - "workclass_ids = ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']\n", - "education_ids = ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']\n", - "marital_status_ids = ['Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse']\n", - "occupation_ids = ['Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces']\n", - "relationship_ids = ['Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried']\n", - "race_ids = ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black']\n", "sex_ids = ['Female', 'Male']\n", - "native_country_ids = ['United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands']\n", - "\n", + "native_country_ids = ['United-States', 'Cambodia', 'England', 'Puerto-Rico', \n", + " 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', \n", + " 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', \n", + " 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', \n", + " 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland',\n", + " 'France', 'Dominican-Republic', 'Laos', 'Ecuador',\n", + " 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', \n", + " 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', \n", + " 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', \n", + " 'Holand-Netherlands']\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Qe9pjRVfphNR" + }, + "outputs": [], + "source": [ + "# Create dropdown for workclass.\n", "workclass = widgets.Dropdown(\n", " options=workclass_ids, \n", " value=workclass_ids[0],\n", " description='workclass:'\n", - ")\n", - "\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "-PVX23I0ppJ3" + }, + "outputs": [], + "source": [ + "# Create dropdown for education.\n", "education = widgets.Dropdown(\n", " options=education_ids, \n", " value=education_ids[0],\n", " description='education:', \n", " width='500px'\n", - ")\n", - " \n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "vf1NZ-cLptHJ" + }, + "outputs": [], + "source": [ + "# Create dropdown for marital status.\n", "marital_status = widgets.Dropdown(\n", " options=marital_status_ids, \n", " value=marital_status_ids[0],\n", " description='marital status:', \n", " width='500px'\n", - ")\n", - "\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "TTMobEncpxK7" + }, + "outputs": [], + "source": [ + "# Create dropdown for occupation.\n", "occupation = widgets.Dropdown(\n", " options=occupation_ids, \n", " value=occupation_ids[0],\n", " description='occupation:', \n", " width='500px'\n", - ")\n", - "\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ijUXgjmxp2eb" + }, + "outputs": [], + "source": [ + "# Create dropdown for relationship.\n", "relationship = widgets.Dropdown(\n", " options=relationship_ids, \n", " value=relationship_ids[0],\n", " description='relationship:', \n", " width='500px'\n", - ")\n", - "\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "uDg6wIr4p6n3" + }, + "outputs": [], + "source": [ + "# Create dropdown for race.\n", "race = widgets.Dropdown(\n", " options=race_ids, \n", " value=race_ids[0], \n", " description='race:', \n", " width='500px'\n", - ")\n", - "\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "1iXcIMSsp-E6" + }, + "outputs": [], + "source": [ + "# Create dropdown for sex.\n", "sex = widgets.Dropdown(\n", " options=sex_ids, \n", " value=sex_ids[0],\n", " description='sex:', \n", " width='500px'\n", - ")\n", - "\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "v6yQnqh3qC5N" + }, + "outputs": [], + "source": [ + "# Create dropdown for native country.\n", "native_country = widgets.Dropdown(\n", " options=native_country_ids, \n", " value=native_country_ids[0],\n", " description='native_country:', \n", " width='500px'\n", - ")\n", - "\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "wCKiMoVRqF4x" + }, + "outputs": [], + "source": [ "display(workclass)\n", "display(education)\n", "display(marital_status)\n", @@ -917,7 +1306,7 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "xGVGwgwXSZe_" + "id": "WyVybQzWqQvZ" }, "source": [ "Adjust the slides on the right to the desired test values for your online prediction." @@ -929,15 +1318,15 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "bDzd5GYQSdpa" + "id": "UXvN6l4bqLUu" }, "outputs": [], "source": [ "#@title Make an online prediction: set the numeric variables{ vertical-output: true }\n", "\n", - "age = 34 #@param {type:'slider', min:1, max:100, step:1}\n", + "age = 36 #@param {type:'slider', min:1, max:100, step:1}\n", "capital_gain = 40000 #@param {type:'slider', min:0, max:100000, step:10000}\n", - "capital_loss = 3.8 #@param {type:'slider', min:0, max:4000, step:0.1}\n", + "capital_loss = 559.5 #@param {type:'slider', min:0, max:4000, step:0.1}\n", "fnlwgt = 150000 #@param {type:'slider', min:0, max:1000000, step:50000}\n", "education_num = 9 #@param {type:'slider', min:1, max:16, step:1}\n", "hours_per_week = 40 #@param {type:'slider', min:1, max:100, step:1}" @@ -947,7 +1336,7 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "ZAGi8Co-SU-b" + "id": "wAuMAc-cqdKZ" }, "source": [ "Run the following cell, and then choose the desired test values for your online prediction." @@ -959,7 +1348,7 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "Kc4SKLLPSoKz" + "id": "GJxIJ4KUqeWV" }, "outputs": [], "source": [ @@ -978,40 +1367,33 @@ " 'capital_loss': capital_loss,\n", " 'hours_per_week': hours_per_week,\n", " 'native_country': native_country.value,\n", - "}\n", - "\n", - "prediction_result = client.predict(model=model, inputs=inputs)\n", - "prediction_result" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Get Prediction \n", - "\n", - "We extract the `google.cloud.automl_v1beta1.types.PredictResponse` object `prediction_result` and iterate to create a list of tuples with score and label, then we sort based on highest score and display it." + "}" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "rnbCFWCUqsLO" + }, "outputs": [], "source": [ - "predictions = [(prediction.tables.score, prediction.tables.value.string_value) for prediction in prediction_result.payload]\n", - "predictions = sorted(predictions, key=lambda tup: (tup[0],tup[1]), reverse=True)\n", - "print('Prediction is: ', predictions[0])" + "prediction_result = tables_client.predict(model=model, inputs=inputs)\n", + "prediction_result" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "O9CRdIfrS1nb" + "id": "XoL8HCRFq9D3" }, "source": [ - "Undeploy the model" + "**Get Prediction**\n", + "\n", + "We extract the `google.cloud.automl_v1beta1.types.PredictResponse` object `prediction_result` and iterate to create a list of tuples with score and label, then we sort based on highest score and display it." ] }, { @@ -1020,57 +1402,67 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "DWa1idtOS0GE" + "id": "1A8Z5Rf6rGSn" }, "outputs": [], "source": [ - "undeploy_model_response = client.undeploy_model(model=model)" + "predictions = [(prediction.tables.score, prediction.tables.value.string_value) \n", + " for prediction in prediction_result.payload]\n", + "predictions = sorted(\n", + " predictions, key=lambda tup: (tup[0],tup[1]), reverse=True)\n", + "print('Prediction is: ', predictions[0])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "TarOq84-GXch" + "id": "zn6QGHIcrehh" }, "source": [ - "### Batch prediction" + "Undeploy the model" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "Soy5OB8Wbp_R" + "colab": {}, + "colab_type": "code", + "id": "yWLMYtBzrf1S" }, + "outputs": [], "source": [ - "#### Initialize prediction" + "undeploy_model_response = tables_client.undeploy_model(model=model)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "39bIGjIlau5a" + "id": "pKTxwtiZsL2G" }, "source": [ - "Your data source for batch prediction can be GCS or BigQuery. \n", + "## **Batch prediction**\n", + "\n", + "**Initialize prediction**\n", + "\n", + "Your data source for batch prediction can be GCS or BigQuery.\n", "\n", - "For this tutorial, you can use: \n", + "For this tutorial, you can use:\n", "\n", - "- [census_income_batch_prediction_input.csv](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv) as input source. \n", + "* [census_income_batch_prediction_input.csv](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv) as input source. \n", "\n", - "Create a GCS bucket and upload the file into your bucket. \n", "\n", - "Some of the lines in the batch prediction input file are intentionally left missing some values. \n", - "The AutoML Tables logs the errors in the `errors.csv` file.\n", - "Also, enter the UI and create the bucket into which you will load your predictions. \n", + "Create a GCS bucket and upload the file into your bucket.\n", + "\n", + "Some of the lines in the batch prediction input file are intentionally left missing some values. The AutoML Tables logs the errors in the `errors.csv` file. Also, enter the UI and create the bucket into which you will load your predictions.\n", "\n", "The bucket's default name here is `automl-tables-pred` to be replaced with your own.\n", "\n", "**NOTE:** The client library has a bug. If the following cell returns a:\n", "\n", - "`TypeError: Could not convert Any to BatchPredictResult` error, ignore it. \n", + "`TypeError: Could not convert Any to BatchPredictResult` error, ignore it.\n", "\n", "The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells." ] @@ -1078,17 +1470,28 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "mSKRq1XFs-gb" + }, "outputs": [], "source": [ - "GCS_BATCH_PREDICT_URI = '{}/census_income_batch_prediction_input.csv'.format(GCS_STORAGE_BUCKET)\n", - "GCS_BATCH_PREDICT_OUTPUT = '{}/census_income_predictions/'.format(GCS_STORAGE_BUCKET)\n", - "! gsutil cp gs://cloud-ml-data-tables/notebooks/census_income_batch_prediction_input.csv $GCS_BATCH_PREDICT_URI" + "gcs_output_folder_name = 'census_income_predictions' #@param {type: 'string'}\n", + "\n", + "SAMPLE_INPUT = 'gs://cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv'\n", + "GCS_BATCH_PREDICT_OUTPUT = 'gs://{}/{}/'.format(BUCKET_NAME,\n", + " gcs_output_folder_name)\n", + "\n", + "! gsutil cp $SAMPLE_INPUT $GCS_BATCH_PREDICT_OUTPUT" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "bF3NTUzjvrxU" + }, "source": [ "Launch Batch prediction" ] @@ -1096,15 +1499,21 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "MtgmVjuovsoW" + }, "outputs": [], "source": [ - "batch_predict_response = client.batch_predict(\n", + "batch_predict_response = tables_client.batch_predict(\n", " model=model, \n", " gcs_input_uris=GCS_BATCH_PREDICT_URI,\n", " gcs_output_uri_prefix=GCS_BATCH_PREDICT_OUTPUT,\n", ")\n", - "print('Batch prediction operation: {}'.format(batch_predict_response.operation))\n", + "print('Batch prediction operation: {}'.format(\n", + " batch_predict_response.operation))\n", + "\n", "# Wait until batch prediction is done.\n", "batch_predict_result = batch_predict_response.result()\n", "batch_predict_response.metadata" @@ -1112,21 +1521,58 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "pyvmqopCwMD3" + }, + "source": [ + "## **Cleaning up**\n", + "\n", + "To clean up all GCP resources used in this project, you can [delete the GCP\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "AiimlaBwwNCt" + }, + "outputs": [], + "source": [ + "# Delete model resource.\n", + "tables_client.delete_model(model_name=model_name)\n", + "\n", + "# Delete dataset resource.\n", + "tables_client.delete_dataset(dataset_name=dataset_name)\n", + "\n", + "# Delete Cloud Storage objects that were created.\n", + "! gsutil -m rm -r gs://$BUCKET_NAME\n", + " \n", + "# If training model is still running, cancel it.\n", + "automl_client.transport._operations_client.cancel_operation(operation_id) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "AzR0uVbY2BmQ" + }, "source": [ - "### Next steps\n", "\n", - "Please follow latest updates on AutoML [here](https://cloud.google.com/automl/docs/)\n", - "if you have any questions contact us at [cloud-automl-tables-discuss](https://groups.google.com/forum/#!forum/cloud-automl-tables-discuss)" + "## **Next steps**\n", + "Please follow latest updates on AutoML [here](https://cloud.google.com/automl/docs/)." ] } ], "metadata": { "colab": { "collapsed_sections": [], - "name": "census_income_prediction.ipynb", - "provenance": [], - "version": "0.3.2" + "name": "getting_started_notebook.ipynb", + "provenance": [] }, "kernelspec": { "display_name": "Python 3", @@ -1143,9 +1589,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.7" + "version": "3.5.3" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/tables/automl/notebooks/energy_price_forecasting/README.md b/tables/automl/notebooks/energy_price_forecasting/README.md deleted file mode 100644 index f9612854db2f..000000000000 --- a/tables/automl/notebooks/energy_price_forecasting/README.md +++ /dev/null @@ -1,112 +0,0 @@ ----------------------------------------- - -Copyright 2018 Google LLC - -Licensed under the Apache License, Version 2.0 (the "License"); -you may not use this file except in compliance with the License. -You may obtain a copy of the License at - -[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0) - -Unless required by applicable law or agreed to in writing, software -distributed under the License is distributed on an "AS IS" BASIS, -WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -See the License for the specific language governing permissions and limitations under the License. - ----------------------------------------- - -# 1. Introduction - -This guide provides a high-level overview of an energy price forecasting solution, reviewing the significance of the solution and which audiences and use cases it applies to. In this section, we outline the business case for this solution, the problem, the solution, and results. In section 2, we provide the code setup instructions. - -Solution description: Model to forecast hourly energy prices for the next 7 days. - -Significance: This is a good complement to standard demand forecasting models that typically predict N periods in the future. This model does a rolling forecast that is vital for operational decisions. It also takes into consideration historical trends, seasonal patterns, and external factors (like weather) to make more accurate forecasts. - - -## 1.1 Solution scenario - -### Challenge - -Many companies use forecasting models to predict prices, demand, sales, etc. Many of these forecasting problems have similar characteristics that can be leveraged to produce accurate predictions, like historical trends, seasonal patterns, and external factors. - -For example, think about an energy company that needs to accurately forecast the country’s hourly energy prices for the next 5 days (120 predictions) for optimal energy trading. - -At forecast time, they have access to historical energy prices as well as weather forecasts for the time period in question. - -In this particular scenario, an energy company actually hosted a competition ([http://complatt.smartwatt.net/](http://complatt.smartwatt.net/)) for developers to use the data sets to create a more accurate prediction model. - -### Solution - -We solved the energy pricing challenge by preparing a training dataset that encodes historical price trends, seasonal price patterns, and weather forecasts in a single table. We then used that table to train a deep neural network that can make accurate hourly predictions for the next 5 days. - -## 1.2 Similar applications - -Using the solution that we created for the competition, we can now show how other forecasting problems can also be solved with the same solution. - -This type of solution includes any demand forecasting model that predicts N periods in the future and takes advantage of seasonal patterns, historical trends, and external datasets to produce accurate forecasts. - -Here are some additional demand forecasting examples: - -* Sales forecasting - -* Product or service usage forecasting - -* Traffic forecasting - - -# 2. Setting up the solution in a Google Cloud Platform project - -## 2.1 Create GCP project and download raw data - -Learn how to create a GCP project and prepare it for running the solution following these steps: - -1. Create a project in GCP ([article](https://cloud.google.com/resource-manager/docs/creating-managing-projects) on how to create and manage GCP projects). - -2. Raw data for this problem: - ->[MarketPricePT](http://complatt.smartwatt.net/assets/files/historicalRealData/RealMarketPriceDataPT.csv) - Historical hourly energy prices. ->![alt text](https://storage.googleapis.com/images_public/price_schema.png) ->![alt text](https://storage.googleapis.com/images_public/price_data.png) - ->[historical_weather](http://complatt.smartwatt.net/assets/files/weatherHistoricalData/WeatherHistoricalData.zip) - Historical hourly weather forecasts. ->![alt text](https://storage.googleapis.com/images_public/weather_schema.png) ->![alt text](https://storage.googleapis.com/images_public/loc_portugal.png) ->![alt text](https://storage.googleapis.com/images_public/weather_data.png) - -*Disclaimer: The data for both tables comes from [http://complatt.smartwatt.net/](http://complatt.smartwatt.net/). This website hosts a closed competition meant to solve the energy price forecasting problem. The data was not collected or vetted by Google LLC and hence, we cannot guarantee the veracity or quality of it. - - -## 2.2 Execute script for data preparation - -Prepare the data that is going to be used by the forecaster model by following these instructions: - -1. Clone the solution code from here: [https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloudml-energy-price-forecasting](https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloudml-energy-price-forecasting). In the solution code, navigate to the "data_preparation" folder. - -2. Run script "data_preparation.data_prep" to generate training, validation, and testing data as well as the constant files needed for normalization. - -3. Export training, validation, and testing tables as CSVs (into Google Cloud Storage bucket gs://energyforecast/data/csv). - -4. Read the "README.md" file for more information. - -5. Understand which parameters can be passed to the script (to override defaults). - -Training data schema: -![alt text](https://storage.googleapis.com/images_public/training_schema.png) - -## 2.3 Execute notebook in this folder - -Train the forecasting model in AutoML tables by running all cells in the notebook in this folder! - -## 2.4 AutoML Tables Results - -The following results are from our solution to this problem. - -* MAE (Mean Absolute Error) = 0.0416 -* RMSE (Root Mean Squared Error) = 0.0524 - -![alt text](https://storage.googleapis.com/images_public/automl_test.png) - -Feature importance: -![alt text](https://storage.googleapis.com/images_public/feature_importance.png) - diff --git a/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting.ipynb b/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting.ipynb deleted file mode 100644 index 288daabc72fc..000000000000 --- a/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting.ipynb +++ /dev/null @@ -1,702 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "KOAz-lD1P7Kx" - }, - "source": [ - "----------------------------------------\n", - "\n", - "Copyright 2018 Google LLC \n", - "\n", - "Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "you may not use this file except in compliance with the License.\n", - "You may obtain a copy of the License at\n", - "\n", - "[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)\n", - "\n", - "Unless required by applicable law or agreed to in writing, software\n", - "distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "See the License for the specific language governing permissions and limitations under the License.\n", - "\n", - "----------------------------------------" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "m26YhtBMvVWA" - }, - "source": [ - "# Energy Forecasting with AutoML Tables\n", - "\n", - "To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.research.google.com/notebooks/welcome.ipynb) for more information on Colab.\n", - "\n", - "You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to\n", - "\n", - "1. Re-run the initialization and authentication.\n", - "2. Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.\n", - "\n", - "Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html)." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "b--5FDDwCG9C" - }, - "source": [ - "## 1. Project set up\n", - "\n", - "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "AZs0ICgy4jkQ" - }, - "source": [ - "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", - "* Create a Google Cloud Platform (GCP) project.\n", - "* Enable billing.\n", - "* Apply to whitelist your project.\n", - "* Enable AutoML API.\n", - "* Enable AutoML Talbes API.\n", - "* Create a service account, grant required permissions, and download the service account private key.\n", - "\n", - "You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source\n", - "* Create a GCS bucket.\n", - "* Upload the training and batch prediction files.\n", - "\n", - "\n", - "**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "xZECt1oL429r" - }, - "source": [ - "\n", - "\n", - "---\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "rstRPH9SyZj_" - }, - "source": [ - "## 2. Initialize and authenticate\n", - "This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "BR0POq2UzE7e" - }, - "source": [ - "### Install the client library\n", - "Run the following cell to install the client library using `pip`." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 602, - "resources": { - "http://localhost:8080/nbextensions/google.colab/files.js": { - "data": "Ly8gQ29weXJpZ2h0IDIwMTcgR29vZ2xlIExMQwovLwovLyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKLy8geW91IG1heSBub3QgdXNlIHRoaXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNlLgovLyBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQKLy8KLy8gICAgICBodHRwOi8vd3d3LmFwYWNoZS5vcmcvbGljZW5zZXMvTElDRU5TRS0yLjAKLy8KLy8gVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQovLyBkaXN0cmlidXRlZCB1bmRlciB0aGUgTGljZW5zZSBpcyBkaXN0cmlidXRlZCBvbiBhbiAiQVMgSVMiIEJBU0lTLAovLyBXSVRIT1VUIFdBUlJBTlRJRVMgT1IgQ09ORElUSU9OUyBPRiBBTlkgS0lORCwgZWl0aGVyIGV4cHJlc3Mgb3IgaW1wbGllZC4KLy8gU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAovLyBsaW1pdGF0aW9ucyB1bmRlciB0aGUgTGljZW5zZS4KCi8qKgogKiBAZmlsZW92ZXJ2aWV3IEhlbHBlcnMgZm9yIGdvb2dsZS5jb2xhYiBQeXRob24gbW9kdWxlLgogKi8KKGZ1bmN0aW9uKHNjb3BlKSB7CmZ1bmN0aW9uIHNwYW4odGV4dCwgc3R5bGVBdHRyaWJ1dGVzID0ge30pIHsKICBjb25zdCBlbGVtZW50ID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnc3BhbicpOwogIGVsZW1lbnQudGV4dENvbnRlbnQgPSB0ZXh0OwogIGZvciAoY29uc3Qga2V5IG9mIE9iamVjdC5rZXlzKHN0eWxlQXR0cmlidXRlcykpIHsKICAgIGVsZW1lbnQuc3R5bGVba2V5XSA9IHN0eWxlQXR0cmlidXRlc1trZXldOwogIH0KICByZXR1cm4gZWxlbWVudDsKfQoKLy8gTWF4IG51bWJlciBvZiBieXRlcyB3aGljaCB3aWxsIGJlIHVwbG9hZGVkIGF0IGEgdGltZS4KY29uc3QgTUFYX1BBWUxPQURfU0laRSA9IDEwMCAqIDEwMjQ7Ci8vIE1heCBhbW91bnQgb2YgdGltZSB0byBibG9jayB3YWl0aW5nIGZvciB0aGUgdXNlci4KY29uc3QgRklMRV9DSEFOR0VfVElNRU9VVF9NUyA9IDMwICogMTAwMDsKCmZ1bmN0aW9uIF91cGxvYWRGaWxlcyhpbnB1dElkLCBvdXRwdXRJZCkgewogIGNvbnN0IHN0ZXBzID0gdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKTsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIC8vIENhY2hlIHN0ZXBzIG9uIHRoZSBvdXRwdXRFbGVtZW50IHRvIG1ha2UgaXQgYXZhaWxhYmxlIGZvciB0aGUgbmV4dCBjYWxsCiAgLy8gdG8gdXBsb2FkRmlsZXNDb250aW51ZSBmcm9tIFB5dGhvbi4KICBvdXRwdXRFbGVtZW50LnN0ZXBzID0gc3RlcHM7CgogIHJldHVybiBfdXBsb2FkRmlsZXNDb250aW51ZShvdXRwdXRJZCk7Cn0KCi8vIFRoaXMgaXMgcm91Z2hseSBhbiBhc3luYyBnZW5lcmF0b3IgKG5vdCBzdXBwb3J0ZWQgaW4gdGhlIGJyb3dzZXIgeWV0KSwKLy8gd2hlcmUgdGhlcmUgYXJlIG11bHRpcGxlIGFzeW5jaHJvbm91cyBzdGVwcyBhbmQgdGhlIFB5dGhvbiBzaWRlIGlzIGdvaW5nCi8vIHRvIHBvbGwgZm9yIGNvbXBsZXRpb24gb2YgZWFjaCBzdGVwLgovLyBUaGlzIHVzZXMgYSBQcm9taXNlIHRvIGJsb2NrIHRoZSBweXRob24gc2lkZSBvbiBjb21wbGV0aW9uIG9mIGVhY2ggc3RlcCwKLy8gdGhlbiBwYXNzZXMgdGhlIHJlc3VsdCBvZiB0aGUgcHJldmlvdXMgc3RlcCBhcyB0aGUgaW5wdXQgdG8gdGhlIG5leHQgc3RlcC4KZnVuY3Rpb24gX3VwbG9hZEZpbGVzQ29udGludWUob3V0cHV0SWQpIHsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIGNvbnN0IHN0ZXBzID0gb3V0cHV0RWxlbWVudC5zdGVwczsKCiAgY29uc3QgbmV4dCA9IHN0ZXBzLm5leHQob3V0cHV0RWxlbWVudC5sYXN0UHJvbWlzZVZhbHVlKTsKICByZXR1cm4gUHJvbWlzZS5yZXNvbHZlKG5leHQudmFsdWUucHJvbWlzZSkudGhlbigodmFsdWUpID0+IHsKICAgIC8vIENhY2hlIHRoZSBsYXN0IHByb21pc2UgdmFsdWUgdG8gbWFrZSBpdCBhdmFpbGFibGUgdG8gdGhlIG5leHQKICAgIC8vIHN0ZXAgb2YgdGhlIGdlbmVyYXRvci4KICAgIG91dHB1dEVsZW1lbnQubGFzdFByb21pc2VWYWx1ZSA9IHZhbHVlOwogICAgcmV0dXJuIG5leHQudmFsdWUucmVzcG9uc2U7CiAgfSk7Cn0KCi8qKgogKiBHZW5lcmF0b3IgZnVuY3Rpb24gd2hpY2ggaXMgY2FsbGVkIGJldHdlZW4gZWFjaCBhc3luYyBzdGVwIG9mIHRoZSB1cGxvYWQKICogcHJvY2Vzcy4KICogQHBhcmFtIHtzdHJpbmd9IGlucHV0SWQgRWxlbWVudCBJRCBvZiB0aGUgaW5wdXQgZmlsZSBwaWNrZXIgZWxlbWVudC4KICogQHBhcmFtIHtzdHJpbmd9IG91dHB1dElkIEVsZW1lbnQgSUQgb2YgdGhlIG91dHB1dCBkaXNwbGF5LgogKiBAcmV0dXJuIHshSXRlcmFibGU8IU9iamVjdD59IEl0ZXJhYmxlIG9mIG5leHQgc3RlcHMuCiAqLwpmdW5jdGlvbiogdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKSB7CiAgY29uc3QgaW5wdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoaW5wdXRJZCk7CiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gZmFsc2U7CgogIGNvbnN0IG91dHB1dEVsZW1lbnQgPSBkb2N1bWVudC5nZXRFbGVtZW50QnlJZChvdXRwdXRJZCk7CiAgb3V0cHV0RWxlbWVudC5pbm5lckhUTUwgPSAnJzsKCiAgY29uc3QgcGlja2VkUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBpbnB1dEVsZW1lbnQuYWRkRXZlbnRMaXN0ZW5lcignY2hhbmdlJywgKGUpID0+IHsKICAgICAgcmVzb2x2ZShlLnRhcmdldC5maWxlcyk7CiAgICB9KTsKICB9KTsKCiAgY29uc3QgY2FuY2VsID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnYnV0dG9uJyk7CiAgaW5wdXRFbGVtZW50LnBhcmVudEVsZW1lbnQuYXBwZW5kQ2hpbGQoY2FuY2VsKTsKICBjYW5jZWwudGV4dENvbnRlbnQgPSAnQ2FuY2VsIHVwbG9hZCc7CiAgY29uc3QgY2FuY2VsUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBjYW5jZWwub25jbGljayA9ICgpID0+IHsKICAgICAgcmVzb2x2ZShudWxsKTsKICAgIH07CiAgfSk7CgogIC8vIENhbmNlbCB1cGxvYWQgaWYgdXNlciBoYXNuJ3QgcGlja2VkIGFueXRoaW5nIGluIHRpbWVvdXQuCiAgY29uc3QgdGltZW91dFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgc2V0VGltZW91dCgoKSA9PiB7CiAgICAgIHJlc29sdmUobnVsbCk7CiAgICB9LCBGSUxFX0NIQU5HRV9USU1FT1VUX01TKTsKICB9KTsKCiAgLy8gV2FpdCBmb3IgdGhlIHVzZXIgdG8gcGljayB0aGUgZmlsZXMuCiAgY29uc3QgZmlsZXMgPSB5aWVsZCB7CiAgICBwcm9taXNlOiBQcm9taXNlLnJhY2UoW3BpY2tlZFByb21pc2UsIHRpbWVvdXRQcm9taXNlLCBjYW5jZWxQcm9taXNlXSksCiAgICByZXNwb25zZTogewogICAgICBhY3Rpb246ICdzdGFydGluZycsCiAgICB9CiAgfTsKCiAgaWYgKCFmaWxlcykgewogICAgcmV0dXJuIHsKICAgICAgcmVzcG9uc2U6IHsKICAgICAgICBhY3Rpb246ICdjb21wbGV0ZScsCiAgICAgIH0KICAgIH07CiAgfQoKICBjYW5jZWwucmVtb3ZlKCk7CgogIC8vIERpc2FibGUgdGhlIGlucHV0IGVsZW1lbnQgc2luY2UgZnVydGhlciBwaWNrcyBhcmUgbm90IGFsbG93ZWQuCiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gdHJ1ZTsKCiAgZm9yIChjb25zdCBmaWxlIG9mIGZpbGVzKSB7CiAgICBjb25zdCBsaSA9IGRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoJ2xpJyk7CiAgICBsaS5hcHBlbmQoc3BhbihmaWxlLm5hbWUsIHtmb250V2VpZ2h0OiAnYm9sZCd9KSk7CiAgICBsaS5hcHBlbmQoc3BhbigKICAgICAgICBgKCR7ZmlsZS50eXBlIHx8ICduL2EnfSkgLSAke2ZpbGUuc2l6ZX0gYnl0ZXMsIGAgKwogICAgICAgIGBsYXN0IG1vZGlmaWVkOiAkewogICAgICAgICAgICBmaWxlLmxhc3RNb2RpZmllZERhdGUgPyBmaWxlLmxhc3RNb2RpZmllZERhdGUudG9Mb2NhbGVEYXRlU3RyaW5nKCkgOgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAnbi9hJ30gLSBgKSk7CiAgICBjb25zdCBwZXJjZW50ID0gc3BhbignMCUgZG9uZScpOwogICAgbGkuYXBwZW5kQ2hpbGQocGVyY2VudCk7CgogICAgb3V0cHV0RWxlbWVudC5hcHBlbmRDaGlsZChsaSk7CgogICAgY29uc3QgZmlsZURhdGFQcm9taXNlID0gbmV3IFByb21pc2UoKHJlc29sdmUpID0+IHsKICAgICAgY29uc3QgcmVhZGVyID0gbmV3IEZpbGVSZWFkZXIoKTsKICAgICAgcmVhZGVyLm9ubG9hZCA9IChlKSA9PiB7CiAgICAgICAgcmVzb2x2ZShlLnRhcmdldC5yZXN1bHQpOwogICAgICB9OwogICAgICByZWFkZXIucmVhZEFzQXJyYXlCdWZmZXIoZmlsZSk7CiAgICB9KTsKICAgIC8vIFdhaXQgZm9yIHRoZSBkYXRhIHRvIGJlIHJlYWR5LgogICAgbGV0IGZpbGVEYXRhID0geWllbGQgewogICAgICBwcm9taXNlOiBmaWxlRGF0YVByb21pc2UsCiAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgYWN0aW9uOiAnY29udGludWUnLAogICAgICB9CiAgICB9OwoKICAgIC8vIFVzZSBhIGNodW5rZWQgc2VuZGluZyB0byBhdm9pZCBtZXNzYWdlIHNpemUgbGltaXRzLiBTZWUgYi82MjExNTY2MC4KICAgIGxldCBwb3NpdGlvbiA9IDA7CiAgICB3aGlsZSAocG9zaXRpb24gPCBmaWxlRGF0YS5ieXRlTGVuZ3RoKSB7CiAgICAgIGNvbnN0IGxlbmd0aCA9IE1hdGgubWluKGZpbGVEYXRhLmJ5dGVMZW5ndGggLSBwb3NpdGlvbiwgTUFYX1BBWUxPQURfU0laRSk7CiAgICAgIGNvbnN0IGNodW5rID0gbmV3IFVpbnQ4QXJyYXkoZmlsZURhdGEsIHBvc2l0aW9uLCBsZW5ndGgpOwogICAgICBwb3NpdGlvbiArPSBsZW5ndGg7CgogICAgICBjb25zdCBiYXNlNjQgPSBidG9hKFN0cmluZy5mcm9tQ2hhckNvZGUuYXBwbHkobnVsbCwgY2h1bmspKTsKICAgICAgeWllbGQgewogICAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgICBhY3Rpb246ICdhcHBlbmQnLAogICAgICAgICAgZmlsZTogZmlsZS5uYW1lLAogICAgICAgICAgZGF0YTogYmFzZTY0LAogICAgICAgIH0sCiAgICAgIH07CiAgICAgIHBlcmNlbnQudGV4dENvbnRlbnQgPQogICAgICAgICAgYCR7TWF0aC5yb3VuZCgocG9zaXRpb24gLyBmaWxlRGF0YS5ieXRlTGVuZ3RoKSAqIDEwMCl9JSBkb25lYDsKICAgIH0KICB9CgogIC8vIEFsbCBkb25lLgogIHlpZWxkIHsKICAgIHJlc3BvbnNlOiB7CiAgICAgIGFjdGlvbjogJ2NvbXBsZXRlJywKICAgIH0KICB9Owp9CgpzY29wZS5nb29nbGUgPSBzY29wZS5nb29nbGUgfHwge307CnNjb3BlLmdvb2dsZS5jb2xhYiA9IHNjb3BlLmdvb2dsZS5jb2xhYiB8fCB7fTsKc2NvcGUuZ29vZ2xlLmNvbGFiLl9maWxlcyA9IHsKICBfdXBsb2FkRmlsZXMsCiAgX3VwbG9hZEZpbGVzQ29udGludWUsCn07Cn0pKHNlbGYpOwo=", - "headers": [ - [ - "content-type", - "application/javascript" - ] - ], - "ok": true, - "status": 200, - "status_text": "" - } - } - }, - "colab_type": "code", - "id": "43aXKjDRt_qZ", - "outputId": "4d3628f9-e5be-4145-f550-8eaffca97d37" - }, - "outputs": [], - "source": [ - "#@title Install AutoML Tables client library { vertical-output: true }\n", - "\n", - "!pip install google-cloud-automl" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "s3F2xbEJdDvN" - }, - "source": [ - "### Set Project and Location" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "0uX4aJYUiXh5" - }, - "source": [ - "Enter your GCP project ID." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 34 - }, - "colab_type": "code", - "id": "6R4h5HF1Dtds", - "outputId": "1e049b34-4683-4755-ab08-aec08de2bc66" - }, - "outputs": [], - "source": [ - "#@title GCP project ID and location\n", - "\n", - "project_id = 'energy-forecasting' #@param {type:'string'}\n", - "location = 'us-central1' #@param {type:'string'}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "eVFsPPEociwF" - }, - "source": [ - "### Authenticate using service account key\n", - "Run the following cell. Click on the 'Choose Files' button and select the service account private key file. If your Service Account key file or folder is hidden, you can reveal it in a Mac by pressing the Command + Shift + . combo." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 71, - "resources": { - "http://localhost:8080/nbextensions/google.colab/files.js": { - "data": "Ly8gQ29weXJpZ2h0IDIwMTcgR29vZ2xlIExMQwovLwovLyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKLy8geW91IG1heSBub3QgdXNlIHRoaXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNlLgovLyBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQKLy8KLy8gICAgICBodHRwOi8vd3d3LmFwYWNoZS5vcmcvbGljZW5zZXMvTElDRU5TRS0yLjAKLy8KLy8gVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQovLyBkaXN0cmlidXRlZCB1bmRlciB0aGUgTGljZW5zZSBpcyBkaXN0cmlidXRlZCBvbiBhbiAiQVMgSVMiIEJBU0lTLAovLyBXSVRIT1VUIFdBUlJBTlRJRVMgT1IgQ09ORElUSU9OUyBPRiBBTlkgS0lORCwgZWl0aGVyIGV4cHJlc3Mgb3IgaW1wbGllZC4KLy8gU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAovLyBsaW1pdGF0aW9ucyB1bmRlciB0aGUgTGljZW5zZS4KCi8qKgogKiBAZmlsZW92ZXJ2aWV3IEhlbHBlcnMgZm9yIGdvb2dsZS5jb2xhYiBQeXRob24gbW9kdWxlLgogKi8KKGZ1bmN0aW9uKHNjb3BlKSB7CmZ1bmN0aW9uIHNwYW4odGV4dCwgc3R5bGVBdHRyaWJ1dGVzID0ge30pIHsKICBjb25zdCBlbGVtZW50ID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnc3BhbicpOwogIGVsZW1lbnQudGV4dENvbnRlbnQgPSB0ZXh0OwogIGZvciAoY29uc3Qga2V5IG9mIE9iamVjdC5rZXlzKHN0eWxlQXR0cmlidXRlcykpIHsKICAgIGVsZW1lbnQuc3R5bGVba2V5XSA9IHN0eWxlQXR0cmlidXRlc1trZXldOwogIH0KICByZXR1cm4gZWxlbWVudDsKfQoKLy8gTWF4IG51bWJlciBvZiBieXRlcyB3aGljaCB3aWxsIGJlIHVwbG9hZGVkIGF0IGEgdGltZS4KY29uc3QgTUFYX1BBWUxPQURfU0laRSA9IDEwMCAqIDEwMjQ7Ci8vIE1heCBhbW91bnQgb2YgdGltZSB0byBibG9jayB3YWl0aW5nIGZvciB0aGUgdXNlci4KY29uc3QgRklMRV9DSEFOR0VfVElNRU9VVF9NUyA9IDMwICogMTAwMDsKCmZ1bmN0aW9uIF91cGxvYWRGaWxlcyhpbnB1dElkLCBvdXRwdXRJZCkgewogIGNvbnN0IHN0ZXBzID0gdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKTsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIC8vIENhY2hlIHN0ZXBzIG9uIHRoZSBvdXRwdXRFbGVtZW50IHRvIG1ha2UgaXQgYXZhaWxhYmxlIGZvciB0aGUgbmV4dCBjYWxsCiAgLy8gdG8gdXBsb2FkRmlsZXNDb250aW51ZSBmcm9tIFB5dGhvbi4KICBvdXRwdXRFbGVtZW50LnN0ZXBzID0gc3RlcHM7CgogIHJldHVybiBfdXBsb2FkRmlsZXNDb250aW51ZShvdXRwdXRJZCk7Cn0KCi8vIFRoaXMgaXMgcm91Z2hseSBhbiBhc3luYyBnZW5lcmF0b3IgKG5vdCBzdXBwb3J0ZWQgaW4gdGhlIGJyb3dzZXIgeWV0KSwKLy8gd2hlcmUgdGhlcmUgYXJlIG11bHRpcGxlIGFzeW5jaHJvbm91cyBzdGVwcyBhbmQgdGhlIFB5dGhvbiBzaWRlIGlzIGdvaW5nCi8vIHRvIHBvbGwgZm9yIGNvbXBsZXRpb24gb2YgZWFjaCBzdGVwLgovLyBUaGlzIHVzZXMgYSBQcm9taXNlIHRvIGJsb2NrIHRoZSBweXRob24gc2lkZSBvbiBjb21wbGV0aW9uIG9mIGVhY2ggc3RlcCwKLy8gdGhlbiBwYXNzZXMgdGhlIHJlc3VsdCBvZiB0aGUgcHJldmlvdXMgc3RlcCBhcyB0aGUgaW5wdXQgdG8gdGhlIG5leHQgc3RlcC4KZnVuY3Rpb24gX3VwbG9hZEZpbGVzQ29udGludWUob3V0cHV0SWQpIHsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIGNvbnN0IHN0ZXBzID0gb3V0cHV0RWxlbWVudC5zdGVwczsKCiAgY29uc3QgbmV4dCA9IHN0ZXBzLm5leHQob3V0cHV0RWxlbWVudC5sYXN0UHJvbWlzZVZhbHVlKTsKICByZXR1cm4gUHJvbWlzZS5yZXNvbHZlKG5leHQudmFsdWUucHJvbWlzZSkudGhlbigodmFsdWUpID0+IHsKICAgIC8vIENhY2hlIHRoZSBsYXN0IHByb21pc2UgdmFsdWUgdG8gbWFrZSBpdCBhdmFpbGFibGUgdG8gdGhlIG5leHQKICAgIC8vIHN0ZXAgb2YgdGhlIGdlbmVyYXRvci4KICAgIG91dHB1dEVsZW1lbnQubGFzdFByb21pc2VWYWx1ZSA9IHZhbHVlOwogICAgcmV0dXJuIG5leHQudmFsdWUucmVzcG9uc2U7CiAgfSk7Cn0KCi8qKgogKiBHZW5lcmF0b3IgZnVuY3Rpb24gd2hpY2ggaXMgY2FsbGVkIGJldHdlZW4gZWFjaCBhc3luYyBzdGVwIG9mIHRoZSB1cGxvYWQKICogcHJvY2Vzcy4KICogQHBhcmFtIHtzdHJpbmd9IGlucHV0SWQgRWxlbWVudCBJRCBvZiB0aGUgaW5wdXQgZmlsZSBwaWNrZXIgZWxlbWVudC4KICogQHBhcmFtIHtzdHJpbmd9IG91dHB1dElkIEVsZW1lbnQgSUQgb2YgdGhlIG91dHB1dCBkaXNwbGF5LgogKiBAcmV0dXJuIHshSXRlcmFibGU8IU9iamVjdD59IEl0ZXJhYmxlIG9mIG5leHQgc3RlcHMuCiAqLwpmdW5jdGlvbiogdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKSB7CiAgY29uc3QgaW5wdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoaW5wdXRJZCk7CiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gZmFsc2U7CgogIGNvbnN0IG91dHB1dEVsZW1lbnQgPSBkb2N1bWVudC5nZXRFbGVtZW50QnlJZChvdXRwdXRJZCk7CiAgb3V0cHV0RWxlbWVudC5pbm5lckhUTUwgPSAnJzsKCiAgY29uc3QgcGlja2VkUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBpbnB1dEVsZW1lbnQuYWRkRXZlbnRMaXN0ZW5lcignY2hhbmdlJywgKGUpID0+IHsKICAgICAgcmVzb2x2ZShlLnRhcmdldC5maWxlcyk7CiAgICB9KTsKICB9KTsKCiAgY29uc3QgY2FuY2VsID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnYnV0dG9uJyk7CiAgaW5wdXRFbGVtZW50LnBhcmVudEVsZW1lbnQuYXBwZW5kQ2hpbGQoY2FuY2VsKTsKICBjYW5jZWwudGV4dENvbnRlbnQgPSAnQ2FuY2VsIHVwbG9hZCc7CiAgY29uc3QgY2FuY2VsUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBjYW5jZWwub25jbGljayA9ICgpID0+IHsKICAgICAgcmVzb2x2ZShudWxsKTsKICAgIH07CiAgfSk7CgogIC8vIENhbmNlbCB1cGxvYWQgaWYgdXNlciBoYXNuJ3QgcGlja2VkIGFueXRoaW5nIGluIHRpbWVvdXQuCiAgY29uc3QgdGltZW91dFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgc2V0VGltZW91dCgoKSA9PiB7CiAgICAgIHJlc29sdmUobnVsbCk7CiAgICB9LCBGSUxFX0NIQU5HRV9USU1FT1VUX01TKTsKICB9KTsKCiAgLy8gV2FpdCBmb3IgdGhlIHVzZXIgdG8gcGljayB0aGUgZmlsZXMuCiAgY29uc3QgZmlsZXMgPSB5aWVsZCB7CiAgICBwcm9taXNlOiBQcm9taXNlLnJhY2UoW3BpY2tlZFByb21pc2UsIHRpbWVvdXRQcm9taXNlLCBjYW5jZWxQcm9taXNlXSksCiAgICByZXNwb25zZTogewogICAgICBhY3Rpb246ICdzdGFydGluZycsCiAgICB9CiAgfTsKCiAgaWYgKCFmaWxlcykgewogICAgcmV0dXJuIHsKICAgICAgcmVzcG9uc2U6IHsKICAgICAgICBhY3Rpb246ICdjb21wbGV0ZScsCiAgICAgIH0KICAgIH07CiAgfQoKICBjYW5jZWwucmVtb3ZlKCk7CgogIC8vIERpc2FibGUgdGhlIGlucHV0IGVsZW1lbnQgc2luY2UgZnVydGhlciBwaWNrcyBhcmUgbm90IGFsbG93ZWQuCiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gdHJ1ZTsKCiAgZm9yIChjb25zdCBmaWxlIG9mIGZpbGVzKSB7CiAgICBjb25zdCBsaSA9IGRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoJ2xpJyk7CiAgICBsaS5hcHBlbmQoc3BhbihmaWxlLm5hbWUsIHtmb250V2VpZ2h0OiAnYm9sZCd9KSk7CiAgICBsaS5hcHBlbmQoc3BhbigKICAgICAgICBgKCR7ZmlsZS50eXBlIHx8ICduL2EnfSkgLSAke2ZpbGUuc2l6ZX0gYnl0ZXMsIGAgKwogICAgICAgIGBsYXN0IG1vZGlmaWVkOiAkewogICAgICAgICAgICBmaWxlLmxhc3RNb2RpZmllZERhdGUgPyBmaWxlLmxhc3RNb2RpZmllZERhdGUudG9Mb2NhbGVEYXRlU3RyaW5nKCkgOgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAnbi9hJ30gLSBgKSk7CiAgICBjb25zdCBwZXJjZW50ID0gc3BhbignMCUgZG9uZScpOwogICAgbGkuYXBwZW5kQ2hpbGQocGVyY2VudCk7CgogICAgb3V0cHV0RWxlbWVudC5hcHBlbmRDaGlsZChsaSk7CgogICAgY29uc3QgZmlsZURhdGFQcm9taXNlID0gbmV3IFByb21pc2UoKHJlc29sdmUpID0+IHsKICAgICAgY29uc3QgcmVhZGVyID0gbmV3IEZpbGVSZWFkZXIoKTsKICAgICAgcmVhZGVyLm9ubG9hZCA9IChlKSA9PiB7CiAgICAgICAgcmVzb2x2ZShlLnRhcmdldC5yZXN1bHQpOwogICAgICB9OwogICAgICByZWFkZXIucmVhZEFzQXJyYXlCdWZmZXIoZmlsZSk7CiAgICB9KTsKICAgIC8vIFdhaXQgZm9yIHRoZSBkYXRhIHRvIGJlIHJlYWR5LgogICAgbGV0IGZpbGVEYXRhID0geWllbGQgewogICAgICBwcm9taXNlOiBmaWxlRGF0YVByb21pc2UsCiAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgYWN0aW9uOiAnY29udGludWUnLAogICAgICB9CiAgICB9OwoKICAgIC8vIFVzZSBhIGNodW5rZWQgc2VuZGluZyB0byBhdm9pZCBtZXNzYWdlIHNpemUgbGltaXRzLiBTZWUgYi82MjExNTY2MC4KICAgIGxldCBwb3NpdGlvbiA9IDA7CiAgICB3aGlsZSAocG9zaXRpb24gPCBmaWxlRGF0YS5ieXRlTGVuZ3RoKSB7CiAgICAgIGNvbnN0IGxlbmd0aCA9IE1hdGgubWluKGZpbGVEYXRhLmJ5dGVMZW5ndGggLSBwb3NpdGlvbiwgTUFYX1BBWUxPQURfU0laRSk7CiAgICAgIGNvbnN0IGNodW5rID0gbmV3IFVpbnQ4QXJyYXkoZmlsZURhdGEsIHBvc2l0aW9uLCBsZW5ndGgpOwogICAgICBwb3NpdGlvbiArPSBsZW5ndGg7CgogICAgICBjb25zdCBiYXNlNjQgPSBidG9hKFN0cmluZy5mcm9tQ2hhckNvZGUuYXBwbHkobnVsbCwgY2h1bmspKTsKICAgICAgeWllbGQgewogICAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgICBhY3Rpb246ICdhcHBlbmQnLAogICAgICAgICAgZmlsZTogZmlsZS5uYW1lLAogICAgICAgICAgZGF0YTogYmFzZTY0LAogICAgICAgIH0sCiAgICAgIH07CiAgICAgIHBlcmNlbnQudGV4dENvbnRlbnQgPQogICAgICAgICAgYCR7TWF0aC5yb3VuZCgocG9zaXRpb24gLyBmaWxlRGF0YS5ieXRlTGVuZ3RoKSAqIDEwMCl9JSBkb25lYDsKICAgIH0KICB9CgogIC8vIEFsbCBkb25lLgogIHlpZWxkIHsKICAgIHJlc3BvbnNlOiB7CiAgICAgIGFjdGlvbjogJ2NvbXBsZXRlJywKICAgIH0KICB9Owp9CgpzY29wZS5nb29nbGUgPSBzY29wZS5nb29nbGUgfHwge307CnNjb3BlLmdvb2dsZS5jb2xhYiA9IHNjb3BlLmdvb2dsZS5jb2xhYiB8fCB7fTsKc2NvcGUuZ29vZ2xlLmNvbGFiLl9maWxlcyA9IHsKICBfdXBsb2FkRmlsZXMsCiAgX3VwbG9hZEZpbGVzQ29udGludWUsCn07Cn0pKHNlbGYpOwo=", - "headers": [ - [ - "content-type", - "application/javascript" - ] - ], - "ok": true, - "status": 200, - "status_text": "" - } - } - }, - "colab_type": "code", - "id": "u-kCqysAuaJk", - "outputId": "06154a63-f410-435f-b565-cd1599243b88" - }, - "outputs": [], - "source": [ - "#@title Authenticate using service account key and create a client. { vertical-output: true }\n", - "\n", - "from google.oauth2 import service_account\n", - "from google.colab import files\n", - "from google.cloud import automl_v1beta1\n", - "\n", - "# Upload service account key\n", - "keyfile_upload = files.upload()\n", - "keyfile_name = list(keyfile_upload.keys())[0]\n", - "# Authenticate and create an AutoML client.\n", - "credentials = service_account.Credentials.from_service_account_file(keyfile_name)\n", - "client = automl.TablesClient(project=project_id, region=location, credentials=credentials)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "qozQWMnOu48y" - }, - "source": [ - "\n", - "\n", - "---\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "ODt86YuVDZzm" - }, - "source": [ - "## 3. Import training data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "XwjZc9Q62Fm5" - }, - "source": [ - "### Create dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "_JfZFGSceyE_" - }, - "source": [ - "Select a dataset display name and pass your table source information to create a new dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 224 - }, - "colab_type": "code", - "id": "Z_JErW3cw-0J", - "outputId": "7fe366df-73ae-4ab1-ceaa-fd6ced4ccdd9" - }, - "outputs": [], - "source": [ - "#@title Create dataset { vertical-output: true, output-height: 200 }\n", - "\n", - "dataset_display_name = 'energy_forcasting_solution' \n", - "dataset = client.create_dataset(dataset_display_name)\n", - "print(dataset.name) # unique to this dataset\n", - "dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "35YZ9dy34VqJ" - }, - "source": [ - "### Import data" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "3c0o15gVREAw" - }, - "source": [ - "You can import your data to AutoML Tables from GCS or BigQuery. You can create a GCS bucket and upload the data into your bucket. The URI for your file is `gs://BUCKET_NAME/FOLDER_NAME1/FOLDER_NAME2/.../FILE_NAME`. Alternatively you can create a BigQuery table and upload the data into the table. The URI for your table is `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", - "\n", - "Importing data may take a few minutes or hours depending on the size of your data. __If your Colab times out__, run the following command to retrieve your dataset. Replace `dataset_name` with its actual value obtained in the preceding cells.\n", - "\n", - "```python\n", - " # This will work if your display name ('energy_forecasting_solution') is unique to your project.\n", - " dataset = client.get_dataset(dataset_display_name=dataset_display_name)\n", - " # OR, if you have multiple datasets with the same display name ('energy_forecasting_solution'), use the\n", - " # unique indentifier acquired from the above cell ( print(dataset.name) ).\n", - " dataset = client.get_dataset(dataset_name=dataset_name)\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 139 - }, - "colab_type": "code", - "id": "FNVYfpoXJsNB", - "outputId": "0ecc8d11-5bf1-4c2e-f688-b6d9be934e3c" - }, - "outputs": [], - "source": [ - " #@title Import data { vertical-output: true }\n", - "\n", - "dataset_bq_input_uri = 'bq://energy-forecasting.Energy.automldata' #@param {type: 'string'}\n", - "\n", - "import_data_operation = client.import_data(\n", - " dataset=dataset,\n", - " bigquery_input_uri=dataset_bq_input_uri\n", - ")\n", - "\n", - "print('Dataset import operation: {}'.format(import_data_response.operation))\n", - "\n", - "# Wait until import is done.\n", - "import_data_result = import_data_response.result()\n", - "import_data_result" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "QdxBI4s44ZRI" - }, - "source": [ - "### Review the specs" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "RC0PWKqH4jwr" - }, - "source": [ - "Run the following command to see table specs such as row count." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 3247 - }, - "colab_type": "code", - "id": "v2Vzq_gwXxo-", - "outputId": "c89cd7b1-4344-46d9-c4a3-1b012b5b720d" - }, - "outputs": [], - "source": [ - "#@title Table schema { vertical-output: true }\n", - "\n", - "# List table specs\n", - "list_table_specs_response = client.list_table_specs(dataset=dataset)\n", - "table_specs = [s for s in list_table_specs_response]\n", - "\n", - "# List column specs\n", - "list_column_specs_response = client.list_column_specs(dataset=dataset)\n", - "column_specs = {s.display_name: s for s in list_column_specs_response}\n", - "\n", - "# Print Features and data_type:\n", - "\n", - "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) for key, value in column_specs.items()]\n", - "print('Feature list:\\n')\n", - "for feature in features:\n", - " print(feature[0],':', feature[1])" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "FNykW_YOYt6d" - }, - "source": [ - "___" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "kNRVJqVOL8h3" - }, - "source": [ - "## 4. Update dataset: assign a label column and enable nullable columns" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "-57gehId9PQ5" - }, - "source": [ - "AutoML Tables automatically detects your data column type. For example, for the [Iris dataset](https://storage.cloud.google.com/rostam-193618-tutorial/automl-tables-v1beta1/iris.csv) it detects `species` to be categorical and `petal_length`, `petal_width`, `sepal_length`, and `sepal_width` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "iRqdQ7Xiq04x" - }, - "source": [ - "### Update a column: set as categorical" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 34 - }, - "colab_type": "code", - "id": "OCEUIPKegWrf", - "outputId": "44370b2c-f3dc-46bc-cefd-8a6f29f9cabe" - }, - "outputs": [], - "source": [ - "#@title Update dataset { vertical-output: true }\n", - "\n", - "column_to_category = 'hour' #@param {type: 'string'}\n", - "\n", - "update_column_response = client.update_column_spec(\n", - " dataset=dataset,\n", - " column_spec_display_name=column_to_category,\n", - " type_code='CATEGORY'\n", - ")\n", - "\n", - "update_column_response.display_name, update_column_response.data_type " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "nDMH_chybe4w" - }, - "source": [ - "### Update dataset: assign a target and split column" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 360 - }, - "colab_type": "code", - "id": "hVIruWg0u33t", - "outputId": "eeb5f733-16ec-4191-ea59-c2fab30c8442" - }, - "outputs": [], - "source": [ - "#@title Update dataset { vertical-output: true }\n", - "\n", - "target_column_name = 'price' #@param {type: 'string'}\n", - "split_column_name = 'split' #@param {type: 'string'}\n", - "\n", - "client.set_target_column(\n", - " dataset=dataset,\n", - " column_spec_display_name=target_column_name,\n", - ")\n", - "\n", - "client.set_test_train_column(\n", - " dataset=dataset,\n", - " column_spec_display_name=split_column_name,\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "z23NITLrcxmi" - }, - "source": [ - "___" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "FcKgvj1-Tbgj" - }, - "source": [ - "## 5. Creating a model" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "Pnlk8vdQlO_k" - }, - "source": [ - "### Train a model\n", - "\n", - "Specify the duration of the training. For example, `train_budget_milli_node_hours=1000` runs the training for one hour. \n", - "\n", - "If your Colab times out, use `client.list_models()` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model.\n", - "\n", - "```python\n", - " model = client.get_model(model_display_name=model_display_name) \n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 139 - }, - "colab_type": "code", - "id": "11izNd6Fu37N", - "outputId": "1bca25aa-eb19-4b27-a3fa-7ef137aaf4e2" - }, - "outputs": [], - "source": [ - "#@title Create model { vertical-output: true }\n", - "\n", - "model_display_name = 'energy_model' #@param {type:'string'}\n", - "model_train_hours = 12 #@param {type:'integer'}\n", - "model_optimization_objective = 'MINIMIZE_MAE' #@param {type:'string'}\n", - "\n", - "create_model_response = client.create_model(\n", - " model_display_name,\n", - " dataset=dataset,\n", - " optimization_objective=model_optimization_objective,\n", - " train_budget_milli_node_hours=model_train_hours * 1000,\n", - ")\n", - "\n", - "print('Dataset import operation: {}'.format(create_model_response.operation))\n", - "# Wait until model training is done.\n", - "model = create_model_response.result()\n", - "model_name = model.name\n", - "model" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 85 - }, - "colab_type": "code", - "id": "puVew1GgPfQa", - "outputId": "42b9296c-d231-4787-f7fb-4aa1a6ff9bd9" - }, - "outputs": [], - "source": [ - "#@title Model Metrics {vertical-output: true }\n", - "\n", - "metrics= [x for x in client.list_model_evaluations(model=model)][-1]\n", - "metrics.regression_evaluation_metrics" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "YQnfEwyrSt2T" - }, - "source": [ - "![alt text](https://storage.googleapis.com/images_public/automl_test.png)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 272 - }, - "colab_type": "code", - "id": "Vyc8ckbpRMHp", - "outputId": "931d4921-2144-4092-dab6-165c1b1c2a88" - }, - "outputs": [], - "source": [ - "#@title Feature Importance {vertical-output: true }\n", - "\n", - "feat_list = [(x.feature_importance, x.column_display_name) for x in model.tables_model_metadata.tables_model_column_info]\n", - "feat_list.sort(reverse=True)\n", - "feat_list[:15]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "__2gDQ5I5gcj" - }, - "source": [ - "![alt text](https://storage.googleapis.com/images_public/feature_importance.png)\n", - "![alt text](https://storage.googleapis.com/images_public/loc_portugal.png)\n", - "![alt text](https://storage.googleapis.com/images_public/weather_schema.png)\n", - "![alt text](https://storage.googleapis.com/images_public/training_schema.png)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "1wS1is9IY5nK" - }, - "source": [ - "___" - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "name": "Energy_Price_Forecasting.ipynb", - "provenance": [], - "version": "0.3.2" - }, - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.7" - } - }, - "nbformat": 4, - "nbformat_minor": 1 -} diff --git a/tables/automl/notebooks/music_recommendation/music_recommendation.ipynb b/tables/automl/notebooks/music_recommendation/music_recommendation.ipynb index f70b33301a73..f28d352794e7 100644 --- a/tables/automl/notebooks/music_recommendation/music_recommendation.ipynb +++ b/tables/automl/notebooks/music_recommendation/music_recommendation.ipynb @@ -2,8 +2,12 @@ "cells": [ { "cell_type": "code", - "execution_count": 2, - "metadata": {}, + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "f1DklZE5h0CE" + }, "outputs": [], "source": [ "# Copyright 2019 Google LLC\n", @@ -23,186 +27,598 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "wbMLRkw5jn8U" + }, "source": [ - "# Music Recommendation using AutoML Tables\n", + "# **Music Recommendation using AutoML Tables**\n", "\n", - "## Overview\n", - "In this notebook we will see how [AutoML Tables](https://cloud.google.com/automl-tables/) can be used to make music recommendations to users. AutoML Tables is a supervised learning service for structured data that can vastly simplify the model building process.\n", + "\n", + " \n", + " \n", + "
\n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "m05x0iy4jqDY" + }, + "source": [ + "## **Overview**\n", "\n", - "### Dataset\n", + "In this notebook we will see how [AutoML Tables](https://cloud.google.com/automl-tables/) can be used to make music recommendations to users. AutoML Tables is a supervised learning service for structured data that can vastly simplify the model building process.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "y7P5m2M-A1yJ" + }, + "source": [ + "### **Dataset**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "5SYExJ4XAsRA" + }, + "source": [ "AutoML Tables allows data to be imported from either GCS or BigQuery. This tutorial uses the [ListenBrainz](https://console.cloud.google.com/marketplace/details/metabrainz/listenbrainz) dataset from [Cloud Marketplace](https://console.cloud.google.com/marketplace), hosted in BigQuery.\n", "\n", "The ListenBrainz dataset is a log of songs played by users, some notable pieces of the schema include:\n", - " - **user_name:** a user id.\n", - " - **track_name:** a song id.\n", - " - **artist_name:** the artist of the song.\n", - " - **release_name:** the album of the song.\n", - " - **tags:** the genres of the song.\n", "\n", - "### Objective\n", - "The goal of this notebook is to demonstrate how to create a lookup table in BigQuery of songs to recommend to users using a log of user-song listens and AutoML Tables. This will be done by training a binary classification model to predict whether or not a `user` will like a given `song`. In the training data, liking a song was defined as having listened to a song more than twice. **Using the predictions for every `(user, song)` pair to generate a ranking of the most similar songs for each user.**\n", + "##### **Data Schema**\n", "\n", - "As the number of `(user, song)` pairs grows exponentially with the number of unique users and songs, this approach may not be optimal for extremely large datasets. One workaround would be to train a model that learns to embed users and songs in the same embedding space, and use a nearest-neighbors algorithm to get recommendations for users. Unfortunately, AutoML Tables does not expose any feature for training and using embeddings, so a [custom ML model](https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloudml-collaborative-filtering) would need to be used instead.\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "onpOnZOjBBIT" + }, + "source": [ + "### **Objective**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "NdRE10JUAjvX" + }, + "source": [ + "The goal of this notebook is to demonstrate how to create a lookup table in BigQuery of songs to recommend to users using a log of user-song listens and AutoML Tables. This will be done by training a binary classification model to predict whether or not a user will like a given song. In the training data, liking a song was defined as having listened to a song more than twice. **Using the predictions for every `(user, song)` pair to generate a ranking of the most similar songs for each user.**\n", "\n", - "Another recommendation approach that is worth mentioning is [using extreme multiclass classification](https://ai.google/research/pubs/pub45530), as that also circumvents storing every possible pair of users and songs. Unfortunately, AutoML Tables does not support the multiclass classification of more than [100 classes](https://cloud.google.com/automl-tables/docs/prepare#target-requirements).\n", + "As the number of `(user, song)` pairs grows exponentially with the number of unique users and songs, this approach may not be optimal for extremely large datasets. One workaround would be to train a model that learns to embed users and songs in the same embedding space, and use a nearest-neighbors algorithm to get recommendations for users. Unfortunately, AutoML Tables does not expose any feature for training and using embeddings, so a [custom ML model](https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloudml-collaborative-filtering) would need to be used instead.\n", "\n", - "### Costs\n", + "Another recommendation approach that is worth mentioning is [using extreme multiclass classification](https://ai.google/research/pubs/pub45530), as that also circumvents storing every possible pair of users and songs. Unfortunately, AutoML Tables does not support the multiclass classification of more than [100 classes](https://cloud.google.com/automl-tables/docs/prepare#target-requirements)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "w4YELJp6O_xw" + }, + "source": [ + "### **Costs**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "74OP8KFwO_gs" + }, + "source": [ "This tutorial uses billable components of Google Cloud Platform (GCP):\n", - "- Cloud AutoML Tables\n", "\n", - "Learn about [AutoML Tables pricing](https://cloud.google.com/automl-tables/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." + "* Cloud AI Platform\n", + "* Bigquery\n", + "* AutoML Tables\n", + "\n", + "Learn about [Cloud AI Platform pricing](https://cloud.google.com/ml-engine/docs/pricing), [Bigquery pricing](https://cloud.google.com/bigquery/pricing), [AutoML Tables pricing](https://cloud.google.com/automl-tables/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "9kJu-Wz6OI2W" + }, "source": [ - "## 1. Setup" + "## **Set up your local development environment**\n", + "\n", + "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", + "all the requirements to run this notebook. If you are using **AI Platform Notebook**, make sure the machine configuration type is **1 vCPU, 3.75 GB RAM** or above. You can skip this step." ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "FdpoSWy_OMm-" + }, "source": [ - "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", - "* [Enable billing](https://cloud.google.com/billing/docs/how-to/modify-project).\n", - "* [Enable AutoML API](https://console.cloud.google.com/apis/library/automl.googleapis.com?q=automl)" + "**Otherwise**, make sure your environment meets this notebook's requirements.\n", + "You need the following:\n", + "\n", + "* The Google Cloud SDK\n", + "* Git\n", + "* Python 3\n", + "* virtualenv\n", + "* Jupyter notebook running in a virtual environment with Python 3\n", + "\n", + "The Google Cloud guide to [Setting up a Python development\n", + "environment](https://cloud.google.com/python/setup) and the [Jupyter\n", + "installation guide](https://jupyter.org/install) provide detailed instructions\n", + "for meeting these requirements. The following steps provide a condensed set of\n", + "instructions:\n", + "\n", + "1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)\n", + "\n", + "2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)\n", + "\n", + "3. [Install\n", + " virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)\n", + " and create a virtual environment that uses Python 3.\n", + "\n", + "4. Activate that environment and run `pip install jupyter` in a shell to install\n", + " Jupyter.\n", + "\n", + "5. Run `jupyter notebook` in a shell to launch Jupyter.\n", + "\n", + "6. Open this notebook in the Jupyter Notebook Dashboard." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "dgpdHTag9aUC" + }, + "source": [ + "## **Set up your GCP project**\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", + "\n", + "3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n", + "\n", + "4. [Enable AutoML API.](https://console.cloud.google.com/apis/library/automl.googleapis.com?q=automl)\n" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "Plf6qTgXyYSx" + }, "source": [ - "### 1.1 PIP Install Packages and dependencies\n", + "## **PIP Install Packages and dependencies**\n", + "\n", "Install addional dependencies not installed in the notebook environment." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "gt5peqa-h9MO" + }, + "outputs": [], + "source": [ + "# Use the latest major GA version of the framework.\n", + "! pip install --upgrade --quiet --user google-cloud-automl \n", + "! pip install --upgrade --quiet --user google-cloud-bigquery\n", + "! pip install --upgrade --quiet --user seaborn" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "kK5JATKPNf3I" + }, + "source": [ + "**Note:** Try installing using `sudo`, if the above command throw any permission errors." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "YeQtfJyL-fKp" + }, + "source": [ + "`Restart` the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Ip6IboKF-rQd" + }, + "outputs": [], + "source": [ + "from IPython.core.display import HTML\n", + "HTML(\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "vLNVYkQF9QLy" + }, + "source": [ + "## **Set up your GCP Project Id**\n", + "\n", + "Enter your `Project Id` in the cell below. Then run the cell to make sure the\n", + "Cloud SDK uses the right project for all the commands in this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "7rG4S9q1Pjfg" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"[your-project-id]\" #@param {type:\"string\"}\n", + "COMPUTE_REGION = \"us-central1\" # Currently only supported region." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "dr--iN2kAylZ" + }, + "source": [ + "## **Authenticate your GCP account**\n", + "\n", + "**If you are using AI Platform Notebooks**, your environment is already\n", + "authenticated. Skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3yyVCJHFSEKG" + }, + "source": [ + "Otherwise, follow these steps:\n", + "\n", + "1. In the GCP Console, go to the [**Create service account key**\n", + " page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n", + "\n", + "2. From the **Service account** drop-down list, select **New service account**.\n", + "\n", + "3. In the **Service account name** field, enter a name.\n", + "\n", + "4. From the **Role** drop-down list, select\n", + " **AutoML > AutoML Admin** and **BigQuery > BigQuery Admin**.\n", + "\n", + "5. Click *Create*. A JSON file that contains your key downloads to your\n", + "local environment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Yt6PhVG0UdF1" + }, + "source": [ + "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "q5TeVHKDMOJF" + }, "outputs": [], "source": [ - "! pip3 install --upgrade --quiet google-cloud-automl google-cloud-bigquery" + "# Upload the downloaded JSON file that contains your key.\n", + "import sys\n", + "\n", + "if 'google.colab' in sys.modules: \n", + " from google.colab import files\n", + " keyfile_upload = files.upload()\n", + " keyfile = list(keyfile_upload.keys())[0]\n", + " %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n", + " ! gcloud auth activate-service-account --key-file $keyfile" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "d1bnPeDVMR5Q" + }, "source": [ - "__Restart the kernel__ to allow `automl_v1beta1` to be imported. The following cell should succeed after a kernel restart:" + "***If you are running the notebook locally***, enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "fsVNKXESYoeQ" + }, "outputs": [], "source": [ - "from google.cloud import automl_v1beta1" + "# If you are running this notebook locally, replace the string below with the\n", + "# path to your service account key and run this cell to authenticate your GCP\n", + "# account.\n", + "\n", + "%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account\n", + "! gcloud auth activate-service-account --key-file '/path/to/service/account'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ztLNd4NM1i7C" + }, + "source": [ + "## **Import libraries and define constants**" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "LuTnv2o-2oU9" + }, + "source": [ + "Import relevant packages.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "vRK0FR332vhR" + }, + "outputs": [], + "source": [ + "from __future__ import absolute_import\n", + "from __future__ import division\n", + "from __future__ import print_function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "V9GQ6Flrn0-B" + }, + "outputs": [], "source": [ - "### 1.2 Import libraries and define constants" + "from google.cloud import automl_v1beta1 as automl\n", + "from google.cloud import bigquery\n", + "from google.cloud import exceptions\n", + "import seaborn as sns\n", + "\n", + "%matplotlib inline" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "BbfLaWRs2TR7" + }, "source": [ - "Populate the following cell with the necessary constants and run it to initialize constants and create clients for BigQuery and AutoML Tables." + "Populate the following cell with the necessary constants and run it to initialize constants." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "cVfhqUK_h0CO" + }, "outputs": [], "source": [ - "# The GCP project id.\n", - "PROJECT_ID = \"\"\n", - "# The region to use for compute resources (AutoML isn't supported in some regions).\n", - "LOCATION = \"us-central1\"\n", + "#@title Constants { vertical-output: true }\n", + "\n", "# A name for the AutoML tables Dataset to create.\n", - "DATASET_DISPLAY_NAME = \"\"\n", + "DATASET_DISPLAY_NAME = \"music_rec\" #@param {type: 'string'}\n", "# The BigQuery dataset to import data from (doesn't need to exist).\n", - "INPUT_BQ_DATASET = \"\"\n", + "BQ_DATASET_NAME = \"music_rec_dataset\" #@param {type: 'string'}\n", "# The BigQuery table to import data from (doesn't need to exist).\n", - "INPUT_BQ_TABLE = \"\"\n", + "BQ_TABLE_NAME = \"music_rec_table\" #@param {type: 'string'}\n", "# A name for the AutoML tables model to create.\n", - "MODEL_DISPLAY_NAME = \"\"\n", - "# The number of hours to train the model.\n", - "MODEL_TRAIN_HOURS = 0\n", + "MODEL_DISPLAY_NAME = \"music_rec_model\" #@param {type: 'string'}\n", "\n", "assert all([\n", " PROJECT_ID,\n", - " LOCATION,\n", + " COMPUTE_REGION,\n", " DATASET_DISPLAY_NAME,\n", - " INPUT_BQ_DATASET,\n", - " INPUT_BQ_TABLE,\n", + " BQ_DATASET_NAME,\n", + " BQ_TABLE_NAME,\n", " MODEL_DISPLAY_NAME,\n", - " MODEL_TRAIN_HOURS,\n", "])" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "NI_N8n1PC_5l" + }, "source": [ - "Import relevant packages and initialize clients for BigQuery and AutoML Tables." + "Initialize the clients for AutoML, AutoML Tables and BigQuery." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "vwYslwXfDBLy" + }, "outputs": [], "source": [ - "from __future__ import absolute_import\n", - "from __future__ import division\n", - "from __future__ import print_function\n", - "\n", - "from google.cloud import automl_v1beta1\n", - "from google.cloud import bigquery\n", - "from google.cloud import exceptions\n", - "import seaborn as sns\n", - "\n", - "%matplotlib inline\n", - "\n", - "\n", - "tables_client = automl_v1beta1.TablesClient(project=PROJECT_ID, region=LOCATION)\n", + "# Initialize the clients.\n", + "automl_client = automl.AutoMlClient()\n", + "tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)\n", "bq_client = bigquery.Client()" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "eTKoTeWFxycI" + }, "source": [ - "## 2. Create a Dataset" + "## **Test the set up**\n", + "\n", + "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n", + "\n", + "If no dataset has previously imported into AutoML Tables, you shall expect an empty return." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Q9CAYGSNx47m" + }, + "outputs": [], + "source": [ + "# List the datasets.\n", + "list_datasets = tables_client.list_datasets()\n", + "datasets = { dataset.display_name: dataset.name for dataset in list_datasets }\n", + "datasets" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "0kg5GCSYx0ez" + }, + "source": [ + "You can also print the list of your models by running the following cell.\n", + "\n", + "If no model has previously trained using AutoML Tables, you shall expect an empty return.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Jx9Ywkc8x7tK" + }, + "outputs": [], "source": [ - "In order to train a model, a structured dataset must be injested into AutoML tables from either BigQuery or Google Cloud Storage. Once injested, the user will be able to cherry pick columns to use as features, labels, or weights and configure the loss function." + "# List the models.\n", + "list_models = tables_client.list_models()\n", + "models = { model.display_name: model.name for model in list_models }\n", + "models" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "PlE7vCis70xW" + }, "source": [ - "### 2.1 Create BigQuery table" + "## **Import training data**" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "H8oza7faPfEp" + }, "source": [ + "### **Create dataset**\n", + "\n", + "In order to train a model, a structured dataset must be injested into AutoML tables from either BigQuery or Google Cloud Storage. Once injested, the user will be able to cherry pick columns to use as features, labels, or weights and configure the loss function.\n", + "\n", + "#### **Create BigQuery table**\n", "First, do some feature engineering on the original ListenBrainz dataset to turn it into a dataset for training and export it into a seperate BigQuery table:\n", "\n", " 1. Make each sample a unique `(user, song)` pair.\n", @@ -214,7 +630,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "snMU9Vd_h0CW" + }, "outputs": [], "source": [ "query = \"\"\"\n", @@ -297,27 +717,10 @@ " WHERE track_name != \"\"\n", " GROUP BY song\n", " )\n", - " SELECT user, song, artist, tags, albums,\n", - " user_tags0,\n", - " user_tags1,\n", - " user_tags2,\n", - " user_tags3,\n", - " user_tags4,\n", - " user_tags5,\n", - " user_tags6,\n", - " user_tags7,\n", - " user_tags8,\n", - " user_tags9,\n", - " user_tags10,\n", - " user_tags11,\n", - " user_tags12,\n", - " user_tags13,\n", - " user_tags14,\n", - " user_tags15,\n", - " user_tags16,\n", - " user_tags17,\n", - " user_tags18,\n", - " user_tags19,\n", + " SELECT user, song, artist, tags, albums, user_tags0, user_tags1, user_tags2, \n", + " user_tags3, user_tags4, user_tags5, user_tags6, user_tags7, user_tags8, \n", + " user_tags9, user_tags10, user_tags11, user_tags12, user_tags13, user_tags14, \n", + " user_tags15, user_tags16, user_tags17, user_tags18, user_tags19,\n", " IF(user_song_listens > 2, \n", " SQRT(user_song_listens/user_max_listen), \n", " .5/user_song_listens) AS weight,\n", @@ -331,7 +734,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "aPIpyqFwh0CY" + }, "outputs": [], "source": [ "def create_table_from_query(query, table):\n", @@ -342,7 +749,8 @@ " table: a name to give the new table.\n", " \"\"\"\n", " job_config = bigquery.QueryJobConfig()\n", - " bq_dataset = bigquery.Dataset(\"{0}.{1}\".format(PROJECT_ID, INPUT_BQ_DATASET))\n", + " bq_dataset = bigquery.Dataset(\"{0}.{1}\".format(\n", + " PROJECT_ID, BQ_DATASET_NAME))\n", " bq_dataset.location = \"US\"\n", "\n", " try:\n", @@ -350,7 +758,7 @@ " except exceptions.Conflict:\n", " pass\n", "\n", - " table_ref = bq_client.dataset(INPUT_BQ_DATASET).table(table)\n", + " table_ref = bq_client.dataset(BQ_DATASET_NAME).table(table)\n", " job_config.destination = table_ref\n", "\n", " query_job = bq_client.query(query,\n", @@ -364,46 +772,88 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Ac0TZQxah0Cb" + }, "outputs": [], "source": [ - "create_table_from_query(query, INPUT_BQ_TABLE)" + "create_table_from_query(query, BQ_TABLE_NAME)" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "tee_qs5xBYQP" + }, "source": [ - "### 2.2 Create AutoML Dataset" + "### **Create AutoML Dataset**\n", + "\n", + "Create a Dataset by importing the BigQuery table that was just created. Importing data may take a few minutes or hours depending on the size of your data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Pah1WO43h0Cd" + }, + "outputs": [], + "source": [ + "# Create dataset.\n", + "dataset = tables_client.create_dataset(\n", + " dataset_display_name=DATASET_DISPLAY_NAME)\n", + "dataset_name = dataset.name\n", + "dataset" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "-6ujokeldxof" + }, "source": [ - "Create a Dataset by importing the BigQuery table that was just created. Importing data may take a few minutes or hours depending on the size of your data." + "#### **Import Data**" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "NL65mUYtkIgF" + }, "outputs": [], "source": [ - "dataset = tables_client.create_dataset(\n", - " dataset_display_name=DATASET_DISPLAY_NAME)\n", - "\n", + "# Read the data source from BigQuery. \n", "dataset_bq_input_uri = 'bq://{0}.{1}.{2}'.format(\n", - " PROJECT_ID, INPUT_BQ_DATASET, INPUT_BQ_TABLE)\n", + " PROJECT_ID, BQ_DATASET_NAME, BQ_TABLE_NAME)\n", + "\n", "import_data_response = tables_client.import_data(\n", " dataset=dataset, bigquery_input_uri=dataset_bq_input_uri)\n", - "import_data_result = import_data_response.result()\n", - "import_data_result" + "\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))\n", + "\n", + "# Synchronous check of operation status. Wait until import is done.\n", + "print('Dataset import response: {}'.format(import_data_response.result()))\n", + "\n", + "# Verify the status by checking the example_count field.\n", + "dataset = tables_client.get_dataset(dataset_name=dataset_name)\n", + "dataset" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "wej9Lput2k-l" + }, "source": [ "Inspect the datatypes assigned to each column. In this case, the `song` and `artist` should be categorical, not textual." ] @@ -411,7 +861,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "HSUZgHDZh0Cg" + }, "outputs": [], "source": [ "list_column_specs_response = tables_client.list_column_specs(\n", @@ -420,7 +874,7 @@ "\n", "def print_column_specs(column_specs):\n", " \"\"\"Parses the given specs and prints each column and column type.\"\"\"\n", - " data_types = automl_v1beta1.proto.data_types_pb2\n", + " data_types = automl.proto.data_types_pb2\n", " return [(x, data_types.TypeCode.Name(\n", " column_specs[x].data_type.type_code)) for x in column_specs.keys()]\n", "\n", @@ -429,15 +883,13 @@ }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 2.3 Update Dataset params" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "yD4AwSPGC_PR" + }, "source": [ + "## **Update Dataset params**\n", + "\n", "Sometimes, the types AutoML Tables automatically assigns each column will be off from that they were intended to be. When that happens, we need to update Tables with different types for certain columns.\n", "\n", "In this case, set the `song` and `artist` column types to `CATEGORY`." @@ -446,13 +898,19 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "RH7sIHK-h0Ci" + }, "outputs": [], "source": [ + "type_code='CATEGORY' #@param {type:'string'}\n", + "\n", "for col in [\"song\", \"artist\"]:\n", " tables_client.update_column_spec(dataset_display_name=DATASET_DISPLAY_NAME,\n", " column_spec_display_name=col,\n", - " type_code=\"CATEGORY\")\n", + " type_code=type_code)\n", "\n", "list_column_specs_response = tables_client.list_column_specs(\n", " dataset_display_name=DATASET_DISPLAY_NAME)\n", @@ -462,7 +920,10 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "fbaQF2iUbbYf" + }, "source": [ "Not all columns are feature columns, in order to train a model, we need to tell Tables which column should be used as the target variable and, optionally, which column should be used as sample weights." ] @@ -470,7 +931,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "6p4nCgIXh0Cl" + }, "outputs": [], "source": [ "tables_client.set_target_column(dataset_display_name=DATASET_DISPLAY_NAME,\n", @@ -482,70 +947,92 @@ }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Create a Model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "oM1ssFQrDKEt" + }, "source": [ - "Once the Dataset has been configured correctly, we can tell AutoML Tables to train a new model. The amount of resources spent to train this model can be adjusted using a parameter called `train_budget_milli_node_hours`. As the name implies, this puts a maximum budget on how many resources a training job can use up before exporting a servable model.\n", + "## **Create a Model**\n", + "\n", + "Once the Dataset has been configured correctly, we can tell AutoML Tables to train a new model. The amount of resources spent to train this model can be adjusted using a parameter called `'train_budget_milli_node_hours'`. As the name implies, this puts a maximum budget on how many resources a training job can use up before exporting a servable model.\n", "\n", - "Even with a budget of 1 node hour (the minimum possible budget), training a model can take several hours." + "For demonstration purpose, the following command sets the budget as 1 node hour `('train_budget_milli_node_hours': 1000)`. You can increase that number up to a maximum of 72 hours `('train_budget_milli_node_hours': 72000)` for the best model performance.\n", + "\n", + "Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours.\n", + "\n", + "You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by using default optimization objective. Refer [link](https://cloud.google.com/automl-tables/docs/train#opt-obj) for more details." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "rL6De6ZOh0Co" + }, "outputs": [], "source": [ - "tables_client.create_model(\n", + "# The number of hours to train the model.\n", + "model_train_hours = 1 #@param {type:'integer'}\n", + "\n", + "create_model_response = tables_client.create_model(\n", " model_display_name=MODEL_DISPLAY_NAME,\n", " dataset_display_name=DATASET_DISPLAY_NAME,\n", - " train_budget_milli_node_hours= MODEL_TRAIN_HOURS * 1000).result()" + " train_budget_milli_node_hours=model_train_hours*1000)\n", + "\n", + "operation_id = create_model_response.operation.name\n", + "\n", + "print('Create model operation: {}'.format(create_model_response.operation))" ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "4AB-X7q-DKuC" + }, + "outputs": [], "source": [ - "## 4. Model Evaluation" + "# Wait until model training is done.\n", + "model = create_model_response.result()\n", + "model_name = model.name\n", + "model" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "L-dkCJ0mDUb9" + }, "source": [ + "## **Model Evaluation**\n", + "\n", "Because we are optimizing a surrogate problem (predicting the similarity between `(user, song)` pairs) in order to achieve our final objective of producing a list of recommended songs for a user, it's difficult to tell how well the model performs by looking only at the final loss function. Instead, an evaluation metric we can use for our model is `recall@n` for the top `m` most listened to songs for each user. This metric will give us the probability that one of a user's top `m` most listened to songs will appear in the top `n` recommendations we make.\n", "\n", - "In order to get the top recommendations for each user, we need to create a batch job to predict similarity scores between each user and item pair. These similarity scores would then be sorted per user to produce an ordered list of recommended songs." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4.1 Create an evaluation table" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "In order to get the top recommendations for each user, we need to create a batch job to predict similarity scores between each user and item pair. These similarity scores would then be sorted per user to produce an ordered list of recommended songs.\n", + "\n", + "### **Create an evaluation table**\n", + "\n", "Instead of creating a lookup table for all users, let's just focus on the performance for a few users for this demo. We will focus especially on recommendations for the user `rob`, and demonstrate how the others can be included in an overall evaluation metric for the model. We start by creatings a dataset for prediction to feed into the trained model; this is a table of every possible `(user, song)` pair containing the users and corresponding features." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "i8Q-71udh0Cs" + }, "outputs": [], "source": [ "users = [\"rob\", \"fiveofoh\", \"Aerion\"]\n", - "training_table = \"{}.{}.{}\".format(PROJECT_ID, INPUT_BQ_DATASET, INPUT_BQ_TABLE)\n", + "training_table = \"{}.{}.{}\".format(\n", + " PROJECT_ID, BQ_DATASET_NAME, BQ_TABLE_NAME)\n", "query = \"\"\"\n", " WITH user as (\n", " SELECT user, \n", @@ -566,53 +1053,70 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "695ngNjxh0Cw" + }, "outputs": [], "source": [ - "eval_table = \"{}_example\".format(INPUT_BQ_TABLE)\n", + "eval_table = \"{}_example\".format(BQ_TABLE_NAME)\n", "create_table_from_query(query, eval_table)" ] }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4.2 Make predictions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "vB_AMuVuDzVP" + }, "source": [ + "## **Make predictions**\n", + "\n", "Once the prediction table is created, start a batch prediction job. This may take a few minutes." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "PmLTJeRwh0Cz" + }, "outputs": [], "source": [ - "preds_bq_input_uri = \"bq://{}.{}.{}\".format(PROJECT_ID, INPUT_BQ_DATASET, eval_table)\n", + "preds_bq_input_uri = \"bq://{}.{}.{}\".format(\n", + " PROJECT_ID, BQ_DATASET_NAME, eval_table)\n", "preds_bq_output_uri = \"bq://{}\".format(PROJECT_ID)\n", "response = tables_client.batch_predict(model_display_name=MODEL_DISPLAY_NAME,\n", " bigquery_input_uri=preds_bq_input_uri,\n", " bigquery_output_uri=preds_bq_output_uri)\n", - "response.result()\n", - "output_uri = response.metadata.batch_predict_details.output_info.bigquery_output_dataset" + "\n", + "print('Prediction response: {}'.format(response.result()))\n", + "output_uri = response.metadata.batch_predict_details\\\n", + " .output_info.bigquery_output_dataset\n", + "print('Output URI: {}'.format(output_uri))" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "bupHymOIEHUn" + }, "source": [ - "With the similarity predictions for `rob`, we can order by the predictions to get a ranked list of songs to recommend to `rob`." + "With the similarity predictions for rob, we can order by the predictions to get a ranked list of songs to recommend to `rob`." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "1xI63tbYh0C2" + }, "outputs": [], "source": [ "n = 10\n", @@ -634,24 +1138,26 @@ }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 4.3 Evaluate predictions" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "8-NfN_GvERhw" + }, "source": [ - "#### Precision@k and Recall@k\n", + "## **Evaluate predictions**\n", + "\n", + "**Precision@k and Recall@k**\n", "\n", - "To evaluate the recommendations, we can look at the precision@k and recall@k of our predictions for `rob`. Run the cells below to load the recommendations into a pandas dataframe and plot the precisions and recalls at various top-k recommendations. " + "To evaluate the recommendations, we can look at the precision@k and recall@k of our predictions for `rob`. Run the cells below to load the recommendations into a pandas dataframe and plot the precisions and recalls at various top-k recommendations." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "CZJMbvp8h0C4" + }, "outputs": [], "source": [ "query = \"\"\"\n", @@ -676,7 +1182,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "7s0ugRWeh0C8" + }, "outputs": [], "source": [ "precision_at_k = {}\n", @@ -691,7 +1201,7 @@ " precision_at_k[user].append(precision)\n", " recall_at_k[user].append(recall)\n", "\n", - "# plot the precision-recall curve\n", + "# plot the precision-recall curve.\n", "ax = sns.lineplot(recall_at_k[users[0]], precision_at_k[users[0]])\n", "ax.set_title(\"precision-recall curve for varying k\")\n", "ax.set_xlabel(\"recall@k\")\n", @@ -700,24 +1210,26 @@ }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "Achieving a high precision@k means a large proportion of top-k recommended items are relevant to the user. Recall@k shows what proportion of all relevant items appeared in the top-k recommendations." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "GakhtEfVEim5" + }, "source": [ - "#### Mean Average Precision (MAP)\n", + "Achieving a high precision@k means a large proportion of top-k recommended items are relevant to the user. Recall@k shows what proportion of all relevant items appeared in the top-k recommendations.\n", + "\n", + "**Mean Average Precision (MAP)**\n", "\n", - "Precision@k is a good metric for understanding how many relevant recommendations we might make at each top-k. However, we would prefer relevant items to be recommended first when possible and should encode that into our evaluation metric. __Average Precision (AP)__ is a running average of precision@k, rewarding recommendations where the revelant items are seen earlier rather than later. When the averaged across all users for some k, the AP metric is called MAP." + "Precision@k is a good metric for understanding how many relevant recommendations we might make at each top-k. However, we would prefer relevant items to be recommended first when possible and should encode that into our evaluation metric. **Average Precision (AP)** is a running average of precision@k, rewarding recommendations where the revelant items are seen earlier rather than later. When the averaged across all users for some k, the AP metric is called MAP." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "VUTNcwBzh0DA" + }, "outputs": [], "source": [ "def calculate_ap(precision):\n", @@ -735,7 +1247,7 @@ " for k in range(num_k)]\n", "print(\"MAP@50: {}\".format(map_at_k[49]))\n", "\n", - "# plot average precision\n", + "# plot average precision.\n", "ax = sns.lineplot(range(num_k), map_at_k)\n", "ax.set_title(\"MAP@k for varying k\")\n", "ax.set_xlabel(\"k\")\n", @@ -744,92 +1256,58 @@ }, { "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 5. Cleanup" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The following cells clean up the BigQuery tables and AutoML Table Datasets that were created with this notebook to avoid additional charges for storage." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5.1 Delete the Model and Dataset" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "metadata": { + "colab_type": "text", + "id": "1w6CT9kREu_Z" + }, "source": [ - "tables_client.delete_model(model_display_name=MODEL_DISPLAY_NAME)\n", + "## **Cleaning up**\n", "\n", - "tables_client.delete_dataset(dataset_display_name=DATASET_DISPLAY_NAME)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### 5.2 Delete BigQuery datasets" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In order to delete BigQuery tables, make sure the service account linked to this notebook has a role with the `bigquery.tables.delete` permission such as `Big Query Data Owner`. The following command displays the current service account.\n", + "To clean up all GCP resources used in this project, you can [delete the GCP\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.\n", "\n", - "IAM permissions can be adjusted [here](https://console.cloud.google.com/iam-admin/iam)." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!gcloud config list account --format \"value(core.account)\"" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Clean up the BigQuery tables created by this notebook." + "**Delete BigQuery datasets**\n", + "\n", + "In order to delete BigQuery tables, make sure the service account linked to this notebook has a role with the bigquery.tables.delete permission such as Big Query Data Owner. The following command displays the current service account.\n", + "\n", + "IAM permissions can be adjusted [here](https://console.cloud.google.com/navigation-error;errorUrl=%2Fiam-admin%2Fiam%3Fproject%3Dprj-automl-notebook&folder%3D&organizationId%3D/permissions)." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Pry8_3bxh0DM" + }, "outputs": [], "source": [ + "# Delete model resource.\n", + "tables_client.delete_model(model_name=model_name)\n", + "\n", + "# Delete dataset resource.\n", + "tables_client.delete_dataset(dataset_name=dataset_name)\n", + "\n", "# Delete the prediction dataset.\n", "dataset_id = str(output_uri[5:].replace(\":\", \".\"))\n", "bq_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)\n", "\n", "# Delete the training dataset.\n", - "dataset_id = \"{0}.{1}\".format(PROJECT_ID, INPUT_BQ_DATASET)\n", - "bq_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)" + "dataset_id = \"{0}.{1}\".format(PROJECT_ID, BQ_DATASET_NAME)\n", + "bq_client.delete_dataset(dataset_id, delete_contents=True, not_found_ok=True)\n", + "\n", + "# If training model is still running, cancel it.\n", + "automl_client.transport._operations_client.cancel_operation(operation_id)" ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] } ], "metadata": { + "colab": { + "collapsed_sections": [], + "name": "music_recommendation.ipynb", + "provenance": [] + }, "kernelspec": { "display_name": "Python 3", "language": "python", @@ -845,9 +1323,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.7" + "version": "3.5.3" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb b/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb index 39a4cb8bc2de..44c1befcae7c 100644 --- a/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb +++ b/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb @@ -3,7 +3,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ur8xi4C7S06n" + }, "outputs": [], "source": [ "# Copyright 2019 Google LLC\n", @@ -23,24 +27,21 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "DHxMX0JAMELh" + }, "source": [ - "# Purchase Prediction with AutoML Tables\n", + "# **Purchase Prediction with AutoML Tables**\n", "\n", "
Field name Description
user_namea user id
track_name a song id
release_namethe album of the song
artist_namethe artist of the song
tagsthe genres of the song
\n", " \n", - " \n", "
\n", - " \n", - " \"Google Read on cloud.google.com\n", - " \n", - " \n", - " \n", + " \n", " \"Colab Run in Colab\n", " \n", " \n", - " \n", + " \n", " \"GitHub\n", " View on GitHub\n", " \n", @@ -52,66 +53,150 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "OFJAWue1ss3C" + "id": "tvgnzT1CKxrO" }, "source": [ - "## Overview\n", + "## **Overview**\n", "\n", "One of the most common use cases in Marketing is to predict the likelihood of conversion. Conversion could be defined by the marketer as taking a certain action like making a purchase, signing up for a free trial, subscribing to a newsletter, etc. Knowing the likelihood that a marketing lead or prospect will ‘convert’ can enable the marketer to target the lead with the right marketing campaign. This could take the form of remarketing, targeted email campaigns, online offers or other treatments.\n", "\n", - "Here we demonstrate how you can use Bigquery and AutoML Tables to build a supervised binary classification model for purchase prediction.\n", - "\n", - "### Dataset\n", - "\n", - "The model uses a real dataset from the [Google Merchandise store](https://www.googlemerchandisestore.com) consisting of Google Analytics web sessions.\n", + "Here we demonstrate how you can use BigQuery and AutoML Tables to build a supervised binary classification model for purchase prediction." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "sukxx8RLSjRr" + }, + "source": [ + "### **Dataset**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "mmn5rn7kScSt" + }, + "source": [ + "The model uses a real dataset from the [Google Merchandise store](https://www.googlemerchandisestore.com/) consisting of Google Analytics web sessions.\n", "\n", "The goal here is to predict the likelihood of a web visitor visiting the online Google Merchandise Store making a purchase on the website during that Google Analytics session. Past web interactions of the user on the store website in addition to information like browser details and geography are used to make this prediction.\n", "\n", "This is framed as a binary classification model, to label a user during a session as either true (makes a purchase) or false (does not make a purchase). Dataset Details The dataset consists of a set of tables corresponding to Google Analytics sessions being tracked on the Google Merchandise Store. Each table is a single day of GA sessions. More details around the schema can be seen here.\n", "\n", - "You can access the data on BigQuery [here](https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089).\n", + "You can access the data on BigQuery [here](https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "SLq3FfRa8E8X" + }, + "source": [ + "### **Costs**\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "DzxIfOrB71wl" + }, + "source": [ + "This tutorial uses billable components of Google Cloud Platform (GCP):\n", "\n", - "## Instructions\n", + "* Cloud AI Platform\n", + "* Cloud Storage\n", + "* BigQuery\n", + "* AutoML Tables\n", "\n", - "Here is a list of things to do with AutoML Tables:\n", + "Learn about [Cloud AI Platform pricing](https://cloud.google.com/ml-engine/docs/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and [AutoML Tables pricing](https://cloud.google.com/automl-tables/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ze4-nDLfK4pw" + }, + "source": [ + "## Set up your local development environment\n", + "\n", + "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", + "all the requirements to run this notebook. If you are using **AI Platform Notebook**, make sure the machine configuration type is **4 vCPU, 15 GB RAM** or above. You can skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "gCuSR8GkAgzl" + }, + "source": [ + "**Otherwise**, make sure your environment meets this notebook's requirements.\n", + "You need the following:\n", "\n", + "* The Google Cloud SDK\n", + "* Git\n", + "* Python 3\n", + "* virtualenv\n", + "* Jupyter notebook running in a virtual environment with Python 3\n", "\n", - " * Set up your local development environment (optional)\n", - " * Set Project ID and Compute Region\n", - " * Authenticate your GCP account\n", - " * Import Python API SDK and create a Client instance,\n", - " * Create a dataset instance and import the data.\n", - " * Create a model instance and train the model.\n", - " * Evaluating the trained model.\n", - " * Deploy the model on the cloud for online predictions.\n", - " * Make online predictions.\n", - " * Undeploy the model\n", - "\n" + "The Google Cloud guide to [Setting up a Python development\n", + "environment](https://cloud.google.com/python/setup) and the [Jupyter\n", + "installation guide](https://jupyter.org/install) provide detailed instructions\n", + "for meeting these requirements. The following steps provide a condensed set of\n", + "instructions:\n", + "\n", + "1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)\n", + "\n", + "2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)\n", + "\n", + "3. [Install\n", + " virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)\n", + " and create a virtual environment that uses Python 3.\n", + "\n", + "4. Activate that environment and run `pip install jupyter` in a shell to install\n", + " Jupyter.\n", + "\n", + "5. Run `jupyter notebook` in a shell to launch Jupyter.\n", + "\n", + "6. Open this notebook in the Jupyter Notebook Dashboard." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "dMoTkf3BVD39" + "id": "BF1j6f9HApxa" }, "source": [ - "# 1. Before you begin\n", + "## **Set up your GCP project**\n", "\n", - "## Project setup\n", + "**The following steps are required, regardless of your notebook environment.**\n", "\n", - "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to:\n", - "* Create a Google Cloud Platform (GCP) project, replace \"your-project\" with your GCP project ID and set local development environment.\n", - "* Enable billing.\n", - "* Enable AutoML API.\n", - "* Enter your project ID in the cell below. Then run the cell to make sure the\n", + "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", "\n", - "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", - "all the requirements to run this notebook. You can skip this step from the AutoML Tables documentation\n", + "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", "\n", - "Cloud SDK uses the right project for all the commands in this notebook.\n", + "3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n", "\n", - "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands" + "4. [Enable AutoML API.](https://console.cloud.google.com/apis/library/automl.googleapis.com?q=automl)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "i7EUnXsZhAGF" + }, + "source": [ + "## **PIP Install Packages and dependencies**\n", + "\n", + "Install addional dependencies not installed in Notebook environment" ] }, { @@ -120,43 +205,88 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "yXZlxqICsMg2" + "id": "n2kLhBBRvdog" }, "outputs": [], "source": [ - "PROJECT_ID = \"\" # @param {type:\"string\"}\n", - "COMPUTE_REGION = \"us-central1\" # Currently only supported region.\n", - "! gcloud config set project $PROJECT_ID" + "! pip install --upgrade --quiet --user google-cloud-automl\n", + "! pip install --upgrade --quiet --user google-cloud-bigquery\n", + "! pip install --upgrade --quiet --user google-cloud-storage\n", + "! pip install --upgrade --quiet --user matplotlib\n", + "! pip install --upgrade --quiet --user pandas \n", + "! pip install --upgrade --quiet --user pandas-gbq \n", + "! pip install --upgrade --quiet --user gcsfs" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "kK5JATKPNf3I" + }, "source": [ - "\n", - "\n", - "---\n", - "\n" + "**Note:** Try installing using `sudo`, if the above command throw any permission errors." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "f-YlNVLTYXXN" + }, + "source": [ + "`Restart` the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "C16j_LPrYbZa" + }, + "outputs": [], + "source": [ + "from IPython.core.display import HTML\n", + "HTML(\"\")" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "KAg-2-BQ4un6" + "id": "tPXmVHerC58T" + }, + "source": [ + "## **Set up your GCP Project Id**\n", + "\n", + "Enter your `Project Id` in the cell below. Then run the cell to make sure the\n", + "Cloud SDK uses the right project for all the commands in this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "2hI1ChtyvXa4" }, + "outputs": [], "source": [ - "This section runs initialization and authentication. It creates an authenticated session which is required for running any of the following sections." + "PROJECT_ID = \"[your-project-id]\" # @param {type:\"string\"}\n", + "COMPUTE_REGION = \"us-central1\" # Currently only supported region." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "gGuRq4DI47hj" + "id": "dr--iN2kAylZ" }, "source": [ - "## Authenticate your GCP account\n", + "## **Authenticate your GCP account**\n", "\n", "**If you are using AI Platform Notebooks**, your environment is already\n", "authenticated. Skip this step." @@ -164,12 +294,12 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "3yyVCJHFSEKG" + }, "source": [ - "**If you are using Colab**, run the cell below and follow the instructions\n", - "when prompted to authenticate your account via oAuth.\n", - "\n", - "**Otherwise**, follow these steps:\n", + "Otherwise, follow these steps:\n", "\n", "1. In the GCP Console, go to the [**Create service account key**\n", " page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n", @@ -179,14 +309,21 @@ "3. In the **Service account name** field, enter a name.\n", "\n", "4. From the **Role** drop-down list, select\n", - " **AutoML > AutoML Admin** and\n", - " **Storage > Storage Object Admin**.\n", + " **AutoML > AutoML Admin**,\n", + " **Storage > Storage Admin** and **BigQuery > BigQuery Admin**.\n", "\n", "5. Click *Create*. A JSON file that contains your key downloads to your\n", - "local environment.\n", - "\n", - "6. Enter the path to your service account key as the\n", - "`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell." + "local environment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Yt6PhVG0UdF1" + }, + "source": [ + "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." ] }, { @@ -195,195 +332,257 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "m3j1Kl4osNaJ" + "id": "q5TeVHKDMOJF" }, "outputs": [], "source": [ "import sys\n", "\n", - "# If you are running this notebook in Colab, run this cell and follow the\n", - "# instructions to authenticate your GCP account. This provides access to your\n", - "# Cloud Storage bucket and lets you submit training jobs and prediction\n", - "# requests.\n", - "\n", + "# Upload the downloaded JSON file that contains your key.\n", "if 'google.colab' in sys.modules: \n", " from google.colab import files\n", " keyfile_upload = files.upload()\n", " keyfile = list(keyfile_upload.keys())[0]\n", " %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n", - "# If you are running this notebook locally, replace the string below with the\n", - "# path to your service account key and run this cell to authenticate your GCP\n", - "# account.\n", - "else:\n", - " %env GOOGLE_APPLICATION_CREDENTIALS /path/to/service_account.json" + " ! gcloud auth activate-service-account --key-file $keyfile" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "9zuplbargStJ" + "id": "d1bnPeDVMR5Q" }, "source": [ - "## Install the client library\n", - "Run the following cell." + "***If you are running the notebook locally***, enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 34 - }, + "colab": {}, "colab_type": "code", - "id": "KIdmobtSsPj8", - "outputId": "14c234ca-5070-4301-a48c-c69d16ae4c31" + "id": "fsVNKXESYoeQ" }, "outputs": [], "source": [ - "%pip install --quiet google-cloud-automl" + "# If you are running this notebook locally, replace the string below with the\n", + "# path to your service account key and run this cell to authenticate your GCP\n", + "# account.\n", + "\n", + "%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account\n", + "! gcloud auth activate-service-account --key-file '/path/to/service/account'" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "zgPO1eR3CYjk" + }, "source": [ - "### Import libraries and define constants\n", + "## **Create a Cloud Storage bucket**\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "When you submit a training job using the Cloud SDK, you upload a Python package\n", + "containing your training code to a Cloud Storage bucket. AI Platform runs\n", + "the code from this package. In this tutorial, AI Platform also saves the\n", + "trained model that results from your job in the same bucket. You can then\n", + "create an AI Platform model version based on this output in order to serve\n", + "online predictions.\n", "\n", - "First, import Python libraries required for training,\n", - "The code example below demonstrates importing the AutoML Python API module into a python script. " + "Set the name of your Cloud Storage bucket below. It must be unique across all\n", + "Cloud Storage buckets. \n", + "\n", + "You may also change the `REGION` variable, which is used for operations\n", + "throughout the rest of this notebook. Make sure to [choose a region where Cloud\n", + "AI Platform services are\n", + "available](https://cloud.google.com/ml-engine/docs/tensorflow/regions). You may\n", + "not use a Multi-Regional Storage bucket for training with AI Platform." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "cellView": "both", + "colab": {}, + "colab_type": "code", + "id": "MzGDU7TWdts_" + }, "outputs": [], "source": [ - "# AutoML library\n", - "from google.cloud import automl_v1beta1 as automl\n", - "\n", - "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", - "import matplotlib.pyplot as plt" + "BUCKET_NAME = \"[your-bucket-name]\" #@param {type:\"string\"}" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "-EcIXiGsCePi" + }, "source": [ - "Restart the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks." + "**Only if your bucket doesn't exist**: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "NIq7R4HZCfIc" + }, "outputs": [], "source": [ - "from IPython.core.display import HTML\n", - "HTML(\"\")" + "! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "ucvCsknMCims" + }, "source": [ - "### Create API Client to AutoML Service*\n", - "\n", - "**If you are using AI Platform Notebooks**, or *Colab* environment is already\n", - "authenticated using GOOGLE_APPLICATION_CREDENTIALS. Run this step." + "Finally, validate access to your Cloud Storage bucket by examining its contents:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "vhOb7YnwClBb" + }, "outputs": [], "source": [ - "client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)" + "! gsutil ls -al gs://$BUCKET_NAME" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "XoEqT2Y4DJmf" + }, "source": [ - "**If you are using Colab or Jupyter**, and you have defined a service account\n", - "follow the following steps to create the AutoML client\n", - "\n", - "You can see a different way to create the API Clients using service account." + "## **Import libraries and define constants**" ] }, { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "wkJe8sD-EoTE" + }, "source": [ - "# from google.oauth2 import service_account\n", - "# credentials = service_account.Credentials.from_service_account_file('/path/to/service_account.json')\n", - "# client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION, credentials=credentials)" + "Import relevant packages." ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Cj-pbWdxEtZM" + }, + "outputs": [], "source": [ - "---" + "from __future__ import absolute_import\n", + "from __future__ import division\n", + "from __future__ import print_function" ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "6HT8yR2Cvd0a" + }, + "outputs": [], "source": [ - "## Storage setup\n", - "\n", - "You also need to upload your data into [Google Cloud Storage](https://cloud.google.com/storage/) (GCS) or [BigQuery](https://cloud.google.com/bigquery/). \n", - "For example, to use GCS as your data source:\n", - "\n", - "* [Create a GCS bucket](https://cloud.google.com/storage/docs/creating-buckets).\n", - "* Upload the training and batch prediction files." + "# AutoML library.\n", + "from google.cloud import automl_v1beta1 as automl\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "from google.cloud import bigquery\n", + "from google.cloud import storage" ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "YPTWUWT0E32J" + }, + "outputs": [], "source": [ - "\n", - "\n", - "---\n", - "\n" + "import matplotlib.pyplot as plt\n", + "import datetime\n", + "import pandas as pd\n", + "import numpy as np\n", + "from sklearn import metrics" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "e1fYDBjDgYEB" + "id": "MEqIjz0PFCVO" }, "source": [ - "# 3. Import, clean, transform and perform feature engineering on the training Data" + "Populate the following cell with the necessary constants and run it to initialize constants." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "dYoCTvaAgZK2" + "colab": {}, + "colab_type": "code", + "id": "iXC9vCBrGTKE" }, + "outputs": [], "source": [ - "### Create dataset in AutoML Tables\n" + "#@title Constants { vertical-output: true }\n", + "\n", + "# A name for the AutoML tables Dataset to create.\n", + "DATASET_DISPLAY_NAME = 'purchase_prediction' #@param {type: 'string'}\n", + "# A name for the file to hold the nested data.\n", + "NESTED_CSV_NAME = 'FULL.csv' #@param {type: 'string'}\n", + "# A name for the file to hold the unnested data.\n", + "UNNESTED_CSV_NAME = 'FULL_unnested.csv' #@param {type: 'string'}\n", + "# A name for the input train data.\n", + "TRAINING_CSV = 'training_unnested_balanced_FULL' #@param {type: 'string'}\n", + "# A name for the input validation data.\n", + "VALIDATION_CSV = 'validation_unnested_FULL' #@param {type: 'string'}\n", + "# A name for the AutoML tables model to create.\n", + "MODEL_DISPLAY_NAME = 'model_1' #@param {type:'string'}\n", + "\n", + "assert all([\n", + " PROJECT_ID,\n", + " COMPUTE_REGION,\n", + " DATASET_DISPLAY_NAME,\n", + " MODEL_DISPLAY_NAME,\n", + "])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "uPRPqyw2gebp" + "id": "X6xxcNmOGjtY" }, "source": [ - "Select a dataset display name and pass your table source information to create a new dataset.\n" + "Initialize client for AutoML, AutoML Tables, BigQuery and Storage." ] }, { @@ -392,26 +591,29 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "Iu3KNlcwsRhN" + "id": "0y3EourAGWmf" }, "outputs": [], "source": [ - "#@title Create dataset { vertical-output: true, output-height: 200 }\n", - "\n", - "dataset_display_name = 'colab_trial1' #@param {type: 'string'}\n", - "\n", - "dataset = client.create_dataset(dataset_display_name)\n", - "dataset" + "# Initialize the clients.\n", + "automl_client = automl.AutoMlClient()\n", + "tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)\n", + "bq_client = bigquery.Client()\n", + "storage_client = storage.Client()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "iTT5N97D0YPo" + "id": "xdJykMXDozoP" }, "source": [ - "Create a bucket to store the training data in" + "## **Test the set up**\n", + "\n", + "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n", + "\n", + "If no dataset has previously imported into AutoML Tables, you shall expect an empty return." ] }, { @@ -420,23 +622,26 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "RQuGIbyGgud9" + "id": "_dKylOQTpF58" }, "outputs": [], "source": [ - "#@title Create bucket to store data in { vertical-output: true, output-height: 200 }\n", - "\n", - "bucket_name = '' #@param {type: 'string'}\n" + "# List the datasets.\n", + "list_datasets = tables_client.list_datasets()\n", + "datasets = { dataset.display_name: dataset.name for dataset in list_datasets }\n", + "datasets" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "IQJuy1-PpF3b" + "id": "dleTdOMaplSM" }, "source": [ - "### Import Dependencies\n" + "You can also print the list of your models by running the following cell.\n", + "\n", + "If no model has previously trained using AutoML Tables, you shall expect an empty return.\n" ] }, { @@ -445,30 +650,24 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "zzCeDmnnQRNy" + "id": "tMXP6no1pn9p" }, "outputs": [], "source": [ - "!sudo pip install google-cloud-bigquery google-cloud-storage pandas pandas-gbq gcsfs oauth2client\n", - "\n", - "import datetime\n", - "import pandas as pd\n", - "\n", - "import gcsfs\n", - "from google.cloud import bigquery\n", - "from google.cloud import storage\n", - "\n", - "client_bq = bigquery.Client(location='US', project=PROJECT_ID)" + "# List the models.\n", + "list_models = tables_client.list_models()\n", + "models = { model.display_name: model.name for model in list_models }\n", + "models" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "UR5n1crIpQuX" + "id": "Z0g-D23HYX9A" }, "source": [ - "### Transformation and Feature Engineering Functions\n", + "##**Transformation and Feature Engineering Functions**\n", "\n", "The data cleaning and transformation step was by far the most involved. It includes a few sections that create an AutoML tables dataset, pull the Google merchandise store data from BigQuery, transform the data, and save it multiple times to csv files in google cloud storage.\n", "\n", @@ -483,35 +682,23 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "5lqd8kOlYeYx" + }, "source": [ - "#### Feature Engineering\n", + "**Feature Engineering**\n", "\n", - "The dataset had rich information on customer location and behavior; however, it can be improved by performing feature engineering. Moreover, there was a concern about data leakage. The decision to do feature engineering, therefore, had two contributing motivations: remove data leakage without too much loss of useful data, and to improve the signal in our data." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Weekdays\n", + "The dataset had rich information on customer location and behavior; however, it can be improved by performing feature engineering. Moreover, there was a concern about data leakage. The decision to do feature engineering, therefore, had two contributing motivations: remove data leakage without too much loss of useful data, and to improve the signal in our data.\n", "\n", - "The date seemed like a useful piece of information to include, as it could capture seasonal effects. Unfortunately, we only had one year of data, so seasonality on an annual scale would be difficult (read impossible) to incorporate. Fortunately, we could try and detect seasonal effects on a micro, with perhaps equally informative results. We ended up creating a new column of weekdays out of dates, to denote which day of the week the session was held on. This new feature turned out to have some useful predictive power, when added as a variable into our model." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Data Leakage\n", + "**Weekdays**\n", + "\n", + "The date seemed like a useful piece of information to include, as it could capture seasonal effects. Unfortunately, we only had one year of data, so seasonality on an annual scale would be difficult (read impossible) to incorporate. Fortunately, we could try and detect seasonal effects on a micro, with perhaps equally informative results. We ended up creating a new column of weekdays out of dates, to denote which day of the week the session was held on. This new feature turned out to have some useful predictive power, when added as a variable into our model.\n", + "\n", + "**Data Leakage**\n", + "\n", + "The marginal gain from adding a weekday feature, was overshadowed by the concern of data leakage in our training data. In the initial naive models we trained, we got outstanding results. So outstanding that we knew that something must be going on. As it turned out, quite a few features functioned as proxies for the feature we were trying to predict: meaning some of the features we conditioned on to build the model had an almost 1:1 correlation with the target feature. Intuitively, this made sense.\n", "\n", - "The marginal gain from adding a weekday feature, was overshadowed by the concern of data leakage in our training data. In the initial naive models we trained, we got outstanding results. So outstanding that we knew that something must be going on. As it turned out, quite a few features functioned as proxies for the feature we were trying to predict: meaning some of the features we conditioned on to build the model had an almost 1:1 correlation with the target feature. Intuitively, this made sense." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "One feature that exhibited this behavior was the number of page views a customer made during a session. By conditioning on page views in a session, we could very reliably predict which customer sessions a purchase would be made in. At first this seems like the golden ticket, we can reliably predict whether or not a purchase is made! The catch: the full page view information can only be collected at the end of the session, by which point we would also have whether or not a transaction was made. Seen from this perspective, collecting page views at the same time as collecting the transaction information would make it pointless to predict the transaction information using the page views information, as we would already have both. One solution was to drop page views as a feature entirely. This would safely stop the data leakage, but we would lose some critically useful information. Another solution, (the one we ended up going with), was to track the page view information of all previous sessions for a given customer, and use it to inform the current session. This way, we could use the page view information, but only the information that we would have before the session even began. So we created a new column called previous_views, and populated it with the total count of all previous page views made by the customer in all previous sessions. We then deleted the page views feature, to stop the data leakage.\n", "\n", "Our rationale for this change can be boiled down to the concise heuristic: only use the information that is available to us on the first click of the session. Applying this reasoning, we performed similar data engineering on other features which we found to be proxies for the label feature. We also refined our objective in the process: For a visit to the Google Merchandise store, what is the probability that a customer will make a purchase, and can we calculate this probability the moment the customer arrives? By clarifying the question, we both made the result more powerful/useful, and eliminated the data leakage that threatened to make the predictive power trivial." @@ -523,129 +710,159 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "RODZJaq4o9b5" + "id": "BVIYkceJUjCz" }, "outputs": [], "source": [ "def balanceTable(table):\n", - " #class count\n", - " count_class_false, count_class_true = table.totalTransactionRevenue.value_counts()\n", + " # class count.\n", + " count_class_false, count_class_true = table.totalTransactionRevenue\\\n", + " .value_counts()\n", "\n", - " #divide by class\n", - " table_class_false = table[table[\"totalTransactionRevenue\"] == False]\n", - " table_class_true = table[table[\"totalTransactionRevenue\"] == True]\n", + " # divide by class.\n", + " table_class_false = table[table[\"totalTransactionRevenue\"]==False]\n", + " table_class_true = table[table[\"totalTransactionRevenue\"]==True]\n", "\n", - " #random over-sampling\n", - " table_class_true_over = table_class_true.sample(count_class_false, replace = True)\n", + " # random over-sampling.\n", + " table_class_true_over = table_class_true.sample(\n", + " count_class_false, replace=True)\n", " table_test_over = pd.concat([table_class_false, table_class_true_over])\n", - " return table_test_over\n", - "\n", - "\n", + " return table_test_over" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "pBMg-NHTUnMU" + }, + "outputs": [], + "source": [ "def partitionTable(table, dt=20170500):\n", - " #the automl tables model could be training on future data and implicitly learning about past data in the testing\n", - " #dataset, this would cause data leakage. To prevent this, we are training only with the first 9 months of data (table1)\n", - " #and doing validation with the last three months of data (table2).\n", - " table1 = table[table[\"date\"] <= dt]\n", - " table2 = table[table[\"date\"] > dt]\n", - " return table1, table2\n", - "\n", + " # The automl tables model could be training on future data and implicitly learning about past data in the testing\n", + " # dataset, this would cause data leakage. To prevent this, we are training only with the first 9 months of data (table1)\n", + " # and doing validation with the last three months of data (table2).\n", + " table1 = table[table[\"date\"]<=dt].copy(deep=False)\n", + " table2 = table[table[\"date\"]>dt].copy(deep=False)\n", + " return table1, table2" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "smziJuelUqbC" + }, + "outputs": [], + "source": [ "def N_updatePrevCount(table, new_column, old_column):\n", " table = table.fillna(0)\n", " table[new_column] = 1\n", " table.sort_values(by=['fullVisitorId','date'])\n", - " table[new_column] = table.groupby(['fullVisitorId'])[old_column].apply(lambda x: x.cumsum())\n", - " table.drop([old_column], axis = 1, inplace = True)\n", - " return table\n", - "\n", - "\n", - "def N_updateDate(table):\n", - " table['weekday'] = 1\n", - " table['date'] = pd.to_datetime(table['date'].astype(str), format = '%Y%m%d')\n", - " table['weekday'] = table['date'].dt.dayofweek\n", - " return table\n", - "\n", - "\n", - "def change_transaction_values(table):\n", - " table['totalTransactionRevenue'] = table['totalTransactionRevenue'].fillna(0)\n", - " table['totalTransactionRevenue'] = table['totalTransactionRevenue'].apply(lambda x: x!=0)\n", - " return table\n", - "\n", - "def saveTable(table, csv_file_name, bucket_name):\n", - " table.to_csv(csv_file_name, index = False)\n", - " storage_client = storage.Client()\n", - " bucket = storage_client.get_bucket(bucket_name)\n", - " blob = bucket.blob(csv_file_name)\n", - " blob.upload_from_filename(filename = csv_file_name)" + " table[new_column] = table.groupby(['fullVisitorId'])[old_column].apply(\n", + " lambda x: x.cumsum())\n", + " table.drop([old_column], axis=1, inplace=True)\n", + " return table" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "2eGAIUmRqjqX" + "colab": {}, + "colab_type": "code", + "id": "vQ4Hlhg2Uu49" }, + "outputs": [], "source": [ - "### Import training data" + "def N_updateDate(table):\n", + " table['weekday'] = 1\n", + " table['date'] = pd.to_datetime(table['date'].astype(str), format='%Y%m%d')\n", + " table['weekday'] = table['date'].dt.dayofweek\n", + " return table" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "XTmXPMUsTgEs" + "colab": {}, + "colab_type": "code", + "id": "anX4rrFSUxlF" }, + "outputs": [], "source": [ - "You also have the option of just downloading the file, FULL.csv, [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/trial_for_c4m/FULL.csv), instead of running the code below. Just be sure to move the file into the google cloud storage bucket you specified above." + "def change_transaction_values(table):\n", + " table['totalTransactionRevenue'] = table['totalTransactionRevenue'].fillna(0)\n", + " table['totalTransactionRevenue'] = table['totalTransactionRevenue'].apply(\n", + " lambda x: x!=0)\n", + " return table" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "cellView": "both", "colab": {}, "colab_type": "code", - "id": "Bl9-DSjIqj7c" + "id": "RRLNtUbfv3pj" }, "outputs": [], "source": [ - "#@title Input name of file to save data to { vertical-output: true, output-height: 200 }\n", - "query = \"\"\"\n", - "SELECT\n", - " date, \n", - " device, \n", - " geoNetwork, \n", - " totals, \n", - " trafficSource, \n", - " fullVisitorId \n", - "FROM \n", - " `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n", - "WHERE\n", - " _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 366 DAY)) AND\n", - " FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 1 DAY))\n", - "\"\"\"\n", - "df = client_bq.query(query).to_dataframe()\n", - "print(df.iloc[:3])\n", - "path_to_data_pre_transformation = \"FULL.csv\" #@param {type: 'string'}\n", - "saveTable(df, path_to_data_pre_transformation, bucket_name)" + "def saveTable(table, csv_file_name, bucket_name):\n", + " table.to_csv(csv_file_name, index=False)\n", + " bucket = storage_client.get_bucket(bucket_name)\n", + " blob = bucket.blob(csv_file_name)\n", + " blob.upload_from_filename(filename=csv_file_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "T1I1dkSAU73g" + }, + "source": [ + "##**Getting training data**\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "-qfwBGWIB5Nm" + }, + "source": [ + "\n", + "If you are using **Colab** the memory may not be sufficient enough to generate Nested and Unnested data using the queries. In this case, you can directly download the unnested data **FULL_unnested.csv** from [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/trial_for_c4m/FULL_unnested.csv) and upload the file manually to GCS bucket that was created in the previous steps `(BUCKET_NAME)`." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "V5WK71tiq-2b" + "id": "swgcbjAGLgsl" }, "source": [ - "### Unnest the Data" + "*If* you are using **AI Platform Notebook or Local environment**, run the following code" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "5CDSXB-Fv3jb" + }, "outputs": [], "source": [ + "# Save table.\n", "query = \"\"\"\n", "SELECT\n", " date, \n", @@ -654,37 +871,15 @@ " totals, \n", " trafficSource, \n", " fullVisitorId \n", - "FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n", + "FROM \n", + " `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n", "WHERE\n", - "_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 366 DAY)) AND\n", - "FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 1 DAY))\n", + " _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 366 DAY)) AND\n", + " FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 1 DAY))\n", "\"\"\"\n", - "df = client_bq.query(query).to_dataframe()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "colab": {}, - "colab_type": "code", - "id": "RFpgLfeNqUBk" - }, - "outputs": [], - "source": [ - "#some transformations on the basic dataset\n", - "#@title Input the name of file to hold the unnested data to { vertical-output: true, output-height: 200 }\n", - "unnested_file_name = \"FULL_unnested.csv\" #@param {type: 'string'}" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "2dyJlNAVqXUn" - }, - "source": [ - "You also have the option of just downloading the file, FULL_unnested.csv, [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/trial_for_c4m/FULL_unnested.csv), instead of running the code below. Just be sure to move the file into the google cloud storage bucket you specified above." + "df = bq_client.query(query).to_dataframe()\n", + "print(df.iloc[:3])\n", + "saveTable(df, NESTED_CSV_NAME, BUCKET_NAME)" ] }, { @@ -693,12 +888,13 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "tLPHeF2Y2l5l" + "id": "pTHwOgw8ArcA" }, "outputs": [], "source": [ - "\n", - "table = pd.read_csv(\"gs://\"+bucket_name+\"/\"+unnested_file_name, low_memory=False)\n", + "# Unnest the Data.\n", + "nested_gcs_uri = 'gs://{}/{}'.format(BUCKET_NAME, NESTED_CSV_NAME)\n", + "table = pd.read_csv(nested_gcs_uri, low_memory=False)\n", "\n", "column_names = ['device', 'geoNetwork','totals', 'trafficSource']\n", "\n", @@ -708,64 +904,97 @@ " temp = table[name].apply(pd.Series)\n", " table = pd.concat([table, temp], axis=1).drop(name, axis=1)\n", "\n", - "#need to drop a column\n", - "table.drop(['adwordsClickInfo'], axis = 1, inplace = True)\n", - "saveTable(table, unnested_file_name, bucket_name)" + "# need to drop a column.\n", + "table.drop(['adwordsClickInfo'], axis=1, inplace=True)\n", + "saveTable(table, UNNESTED_CSV_NAME, BUCKET_NAME)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "9_WC-AJLsdqo" + "id": "1UL8YqzdWXeu" }, "source": [ - "### Run the Transformations" + "### **Run the Transformations**" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 272 - }, + "colab": {}, "colab_type": "code", - "id": "YWQ4462vnpOg", - "outputId": "5ca7e95a-e0f2-48c2-9b59-8f043d233bd2" + "id": "JJ84Zs68wN3X" }, "outputs": [], "source": [ - "table = pd.read_csv(\"gs://\"+bucket_name+\"/\"+unnested_file_name, low_memory=False)\n", + "# Run the transformations.\n", + "unnested_gcs_uri = 'gs://{}/{}'.format(BUCKET_NAME, UNNESTED_CSV_NAME)\n", + "table = pd.read_csv(unnested_gcs_uri, low_memory=False)\n", "\n", - "consts = ['transactionRevenue', 'transactions', 'adContent', 'browserSize', 'campaignCode', \n", - "'cityId', 'flashVersion', 'javaEnabled', 'language', 'latitude', 'longitude', 'mobileDeviceBranding', \n", - "'mobileDeviceInfo', 'mobileDeviceMarketingName','mobileDeviceModel','mobileInputSelector', 'networkLocation', \n", - "'operatingSystemVersion', 'screenColors', 'screenResolution', 'screenviews', 'sessionQualityDim', 'timeOnScreen',\n", - "'visits', 'uniqueScreenviews', 'browserVersion','referralPath','fullVisitorId', 'date']\n", + "consts = ['transactionRevenue', 'transactions', 'adContent', 'browserSize', \n", + " 'campaignCode', 'cityId', 'flashVersion', 'javaEnabled', 'language', \n", + " 'latitude', 'longitude', 'mobileDeviceBranding', 'mobileDeviceInfo', \n", + " 'mobileDeviceMarketingName','mobileDeviceModel','mobileInputSelector',\n", + " 'networkLocation', 'operatingSystemVersion', 'screenColors', \n", + " 'screenResolution', 'screenviews', 'sessionQualityDim', \n", + " 'timeOnScreen', 'visits', 'uniqueScreenviews', 'browserVersion', \n", + " 'referralPath','fullVisitorId', 'date']\n", "\n", "table = N_updatePrevCount(table, 'previous_views', 'pageviews')\n", "table = N_updatePrevCount(table, 'previous_hits', 'hits')\n", "table = N_updatePrevCount(table, 'previous_timeOnSite', 'timeOnSite')\n", "table = N_updatePrevCount(table, 'previous_Bounces', 'bounces')\n", "\n", - "table = change_transaction_values(table)\n", - "\n", + "table = change_transaction_values(table)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "mTdp0V1wnPer" + }, + "outputs": [], + "source": [ "table1, table2 = partitionTable(table)\n", "table1 = N_updateDate(table1)\n", "table2 = N_updateDate(table2)\n", - "#validation_unnested_FULL.csv = the last 3 months of data\n", "\n", - "table1.drop(consts, axis = 1, inplace = True)\n", - "table2.drop(consts, axis = 1, inplace = True)\n", + "table1.drop(consts, axis=1, inplace=True)\n", + "table2.drop(consts, axis=1, inplace=True)\n", "\n", - "saveTable(table2,'validation_unnested_FULL.csv', bucket_name)\n", + "saveTable(table2,'{}.csv'.format(VALIDATION_CSV), BUCKET_NAME)\n", "\n", "table1 = balanceTable(table1)\n", "\n", - "#training_unnested_FULL.csv = the first 9 months of data\n", - "saveTable(table1, 'training_unnested_balanced_FULL.csv', bucket_name)\n" + "# training_unnested_FULL.csv = the first 9 months of data.\n", + "saveTable(table1, '{}.csv'.format(TRAINING_CSV), BUCKET_NAME)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "8ZpdDzvPP3Gr" + }, + "source": [ + "## **Import Training Data**\n", + "\n", + "Select a dataset display name and pass your table source information to create a new dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "SZy-Idpsdn2_" + }, + "source": [ + "#### **Create Dataset**" ] }, { @@ -774,86 +1003,103 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "LqmARBnRHWh8" + "id": "ZaKxxQTevuV7" }, "outputs": [], "source": [ - "#@title ... take the data source from GCS { vertical-output: true } \n", - "\n", - "dataset_gcs_input_uris = ['gs://{}/training_unnested_balanced_FULL.csv'.format(bucket_name),] #@param\n", - "import_data_operation = client.import_data(\n", - " dataset=dataset,\n", - " gcs_input_uris=dataset_gcs_input_uris\n", - ")\n", - "print('Dataset import operation: {}'.format(import_data_operation))\n", - "\n", - "# Synchronous check of operation status. Wait until import is done.\n", - "import_data_operation.result()\n", - "dataset = client.get_dataset(dataset_name=dataset.name)\n", + "# Create dataset.\n", + "dataset = tables_client.create_dataset(\n", + " dataset_display_name=DATASET_DISPLAY_NAME)\n", + "dataset_name = dataset.name\n", "dataset" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "-6ujokeldxof" + }, "source": [ - "\n", - "\n", - "---\n", - "\n" + "#### **Import Data**" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "W3SiSLS4tml9" + "colab": {}, + "colab_type": "code", + "id": "VDcwd-tswNxn" }, + "outputs": [], "source": [ - "# 4. Update dataset: assign a label column and enable nullable columns" + "# Read the data source from GCS. \n", + "dataset_gcs_input_uris = ['gs://{}/{}.csv'.format(BUCKET_NAME, TRAINING_CSV)]\n", + "\n", + "import_data_response = tables_client.import_data(\n", + " dataset=dataset,\n", + " gcs_input_uris=dataset_gcs_input_uris\n", + ")\n", + "\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))\n", + "\n", + "# Synchronous check of operation status. Wait until import is done.\n", + "print('Dataset import response: {}'.format(import_data_response.result()))\n", + "\n", + "# Verify the status by checking the example_count field.\n", + "dataset = tables_client.get_dataset(dataset_name=dataset_name)\n", + "dataset" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "jVo8Z8PGtpB7" + "id": "uXpSJ3T-S1xx" }, "source": [ - "AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." + "## **Review the specs**\n", + "Run the following command to see table specs such as row count." ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "colab": { - "base_uri": "https://localhost:8080/", - "height": 255 - }, + "colab": {}, "colab_type": "code", - "id": "dMdOoFsXxyxj", - "outputId": "e6fab957-2316-48c0-be66-1bff9dc5c23c" + "id": "XQHzt60WwNhI" }, "outputs": [], "source": [ - "# List table specs\n", - "list_table_specs_response = client.list_table_specs(dataset=dataset)\n", + "# List table specs.\n", + "list_table_specs_response = tables_client.list_table_specs(dataset=dataset)\n", "table_specs = [s for s in list_table_specs_response]\n", "\n", - "# List column specs\n", - "list_column_specs_response = client.list_column_specs(dataset=dataset)\n", + "# List column specs.\n", + "list_column_specs_response = tables_client.list_column_specs(dataset=dataset)\n", "column_specs = {s.display_name: s for s in list_column_specs_response}\n", "\n", - "# Print Features and data_type:\n", - "\n", - "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) for key, value in column_specs.items()]\n", + "# Print Features and data_type.\n", + "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) \n", + " for key, value in column_specs.items()]\n", "print('Feature list:\\n')\n", "for feature in features:\n", - " print(feature[0],':', feature[1])\n", - " \n", + " print(feature[0],':', feature[1])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "_9AIZL9xTIPV" + }, + "outputs": [], + "source": [ "# Table schema pie chart.\n", - "\n", "type_counts = {}\n", "for column_spec in column_specs.values():\n", " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", @@ -861,7 +1107,28 @@ " \n", "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", "plt.axis('equal')\n", - "plt.show()\n" + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "gOeAP21SWrl1" + }, + "source": [ + "##**Update dataset: assign a label column and enable nullable columns**\n", + "AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "8g5I3Ua-Sheq" + }, + "source": [ + "### **Update a column: set to not nullable**\n" ] }, { @@ -870,15 +1137,15 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "AfT4upKysamH" + "id": "pZzF09ogwiu_" }, "outputs": [], "source": [ - "#@title Update a column: set to not nullable { vertical-output: true }\n", - "\n", - "update_column_response = client.update_column_spec(\n", + "# Update column.\n", + "column_spec_display_name = 'totalTransactionRevenue' #@param {type: 'string'}\n", + "update_column_response = tables_client.update_column_spec(\n", " dataset=dataset,\n", - " column_spec_display_name='totalTransactionRevenue',\n", + " column_spec_display_name=column_spec_display_name,\n", " nullable=False,\n", ")\n", "update_column_response" @@ -888,20 +1155,20 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "3O9cFko3t3ai" + "id": "KZQftXACy21j" }, "source": [ - "**Tip:** You can use kwarg `type_code='CATEGORY'` in the preceding `update_column_spec(..)` call to convert the column data type from `FLOAT64` `to `CATEGORY`." + "**Tip:** You can use kwarg `type_code='CATEGORY'` in the preceding `update_column_spec(..)` call to convert the column data type from `FLOAT64` to `CATEGORY`." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "rR2RaPP7t6y8" + "id": "y1NpM6k7XEDm" }, "source": [ - "### Update dataset: assign a target column" + "###**Update dataset: assign a target column**" ] }, { @@ -910,15 +1177,15 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "aTt2mIzbsduV" + "id": "714Fydm8winh" }, "outputs": [], "source": [ - "#@title Update dataset { vertical-output: true }\n", - "\n", - "update_dataset_response = client.set_target_column(\n", + "# Assign target column.\n", + "column_spec_display_name = 'totalTransactionRevenue' #@param {type: 'string'}\n", + "update_dataset_response = tables_client.set_target_column(\n", " dataset=dataset,\n", - " column_spec_display_name='totalTransactionRevenue',\n", + " column_spec_display_name=column_spec_display_name,\n", ")\n", "update_dataset_response" ] @@ -927,20 +1194,20 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "xajewSavt9K1" + "id": "9jzfkZGVeZUA" }, "source": [ - "# 5. Creating a model" + "##**Creating a model**" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "dA-FE6iWt-A_" + "id": "Cb7KjMuzXRNq" }, "source": [ - "### Train a model\n", + "####**Train a model**\n", "\n", "To create the datasets for training, testing and validation, we first had to consider what kind of data we were dealing with. The data we had keeps track of all customer sessions with the Google Merchandise store over a year. AutoML tables does its own training and testing, and delivers a quite nice UI to view the results in. For the training and testing dataset then, we simply used the over sampled, balanced dataset created by the transformations described above. But we first partitioned the dataset to include the first 9 months in one table and the last 3 in another. This allowed us to train and test with an entirely different dataset that what we used to validate.\n", "\n", @@ -951,8 +1218,14 @@ "Training the model may take one hour or more. The following cell keeps running until the training is done. If your Colab times out, use `client.list_models()` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n", "\n", " model = client.get_model(model_name=model_name)\n", - " \n", - "Note that we trained on the first 9 months of data and we validate using the last 3." + "\n", + "Note that we trained on the first 9 months of data and we validate using the last 3.\n", + "\n", + "For demonstration purpose, the following command sets the budget as 1 node hour `('train_budget_milli_node_hours': 1000)`. You can increase that number up to a maximum of 72 hours `('train_budget_milli_node_hours': 72000)` for the best model performance.\n", + "\n", + "Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours.\n", + "\n", + "You can also select the objective to optimize your model training by setting optimization_objective. This solution optimizes the model by using default optimization objective. Refer [link](https://cloud.google.com/automl-tables/docs/train#opt-obj) for more details." ] }, { @@ -961,55 +1234,51 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "Kp0gGkp8H3zj" + "id": "HB3ZX_BMwiep" }, "outputs": [], "source": [ - "#@title Create model { vertical-output: true }\n", - "#this will create a model that can be access through the auto ml tables colab\n", - "model_display_name = 'trial_1' #@param {type:'string'}\n", + "# The number of hours to train the model.\n", + "model_train_hours = 1 #@param {type:'integer'}\n", "\n", - "create_model_response = client.create_model(\n", - " model_display_name,\n", + "create_model_response = tables_client.create_model(\n", + " MODEL_DISPLAY_NAME,\n", " dataset=dataset,\n", - " train_budget_milli_node_hours=1000,\n", + " train_budget_milli_node_hours=model_train_hours*1000,\n", ")\n", - "print('Create model operation: {}'.format(create_model_response.operation))\n", - "# Wait until model training is done.\n", - "model = create_model_response.result()\n", - "model" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "\n", + "operation_id = create_model_response.operation.name\n", "\n", - "---\n", - "\n" + "print('Create model operation: {}'.format(create_model_response.operation))" ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "tCIk1e4UuDxZ" + "colab": {}, + "colab_type": "code", + "id": "y3J0reWbTsrW" }, + "outputs": [], "source": [ - "# 6. Make a prediction" + "# Wait until model training is done.\n", + "model = create_model_response.result()\n", + "model_name = model.name\n", + "model" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "H7Fi5f9zuG5f" + "id": "s9rUSDDQXse3" }, "source": [ - "In this section, we take our validation data prediction results and plot the Precision Recall Curve and the ROC curve of both the false and true predictions.\n", + "##**Make a prediction**\n", + "In this section, we take our validation data prediction results and plot the Precision Recall curve and the ROC curve of both the false and true predictions.\n", "\n", - "There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction. Please replace the 'your_test_bucket' of the gcs_destination with your own bucket where the predictions results will be stored by AutoML Tables." + "There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction. " ] }, { @@ -1019,20 +1288,22 @@ "cellView": "both", "colab": {}, "colab_type": "code", - "id": "AZ_CPff77m4e" + "id": "OJ3DPwzkwiOe" }, "outputs": [], "source": [ - "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n", + "#@title Start batch prediction { vertical-output: true }\n", "\n", - "batch_predict_gcs_input_uris = ['gs://cloud-ml-data-tables/notebooks/validation_unnested_FULL.csv',] #@param\n", - "batch_predict_gcs_output_uri_prefix = 'gs://{}'.format(bucket_name) #@param {type:'string'}\n", - "batch_predict_response = client.batch_predict(\n", + "batch_predict_gcs_input_uris = ['gs://{}/{}.csv'.format(BUCKET_NAME, VALIDATION_CSV)] #@param {type:'string'}\n", + "batch_predict_gcs_output_uri_prefix = 'gs://{}'.format(BUCKET_NAME) #@param {type:'string'}\n", + "\n", + "batch_predict_response = tables_client.batch_predict(\n", " model=model, \n", " gcs_input_uris=batch_predict_gcs_input_uris,\n", " gcs_output_uri_prefix=batch_predict_gcs_output_uri_prefix,\n", ")\n", "print('Batch prediction operation: {}'.format(batch_predict_response.operation))\n", + "\n", "# Wait until batch prediction is done.\n", "batch_predict_result = batch_predict_response.result()\n", "batch_predict_response.metadata" @@ -1042,21 +1313,28 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "utGPmXI-uKNr" + "id": "S4aNtFCPX9Ew" }, "source": [ - "# 7. Evaluate your prediction" + "##**Evaluate your prediction**\n", + "The follow cell creates a Precision Recall curve and a ROC curve for both the true and false classifications." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "GsOdhJeauTC3" + "colab": {}, + "colab_type": "code", + "id": "IOeudrAvdreq" }, + "outputs": [], "source": [ - "The follow cell creates a Precision Recall Curve and a ROC curve for both the true and false classifications.\n", - "Fill in the batch_predict_results_location with the location of the results.csv file created in the previous \"Make a prediction\" step\n" + "def invert(x):\n", + " return 1-x\n", + "\n", + "def switch_label(x):\n", + " return(not x)" ] }, { @@ -1065,35 +1343,36 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "orejkh0CH4mu" + "id": "OdtcQU5kVkem" }, "outputs": [], "source": [ - "\n", - "import numpy as np\n", - "from sklearn import metrics\n", - "import matplotlib.pyplot as plt\n", - "\n", - "def invert(x):\n", - " return 1-x\n", - "\n", - "def switch_label(x):\n", - " return(not x)\n", - "batch_predict_results_location = 'gs:///' #@param {type:'string'}\n", - "\n", - "table = pd.read_csv(batch_predict_results_location +'//tables_1.csv')\n", + "batch_predict_results_location = batch_predict_response.metadata\\\n", + " .batch_predict_details.output_info\\\n", + " .gcs_output_directory\n", + "table = pd.read_csv('{}/tables_1.csv'.format(batch_predict_results_location))\n", "y = table[\"totalTransactionRevenue\"]\n", - "scores = table[\"totalTransactionRevenue_1.0_score\"]\n", - "scores_invert = table['totalTransactionRevenue_0.0_score']\n", - "\n", - "#code for ROC curve, for true values\n", + "scores = table[\"totalTransactionRevenue_True_score\"]\n", + "scores_invert = table['totalTransactionRevenue_False_score']" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "_tYEgv_IeL3T" + }, + "outputs": [], + "source": [ + "# code for ROC curve, for true values.\n", "fpr, tpr, thresholds = metrics.roc_curve(y, scores)\n", "roc_auc = metrics.auc(fpr, tpr)\n", - "\n", "plt.figure()\n", "lw = 2\n", "plt.plot(fpr, tpr, color='darkorange',\n", - " lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n", + " lw=lw, label='ROC curve (area=%0.2f)' % roc_auc)\n", "plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n", "plt.xlim([0.0, 1.0])\n", "plt.ylim([0.0, 1.05])\n", @@ -1101,16 +1380,26 @@ "plt.ylabel('True Positive Rate')\n", "plt.title('Receiver operating characteristic for True')\n", "plt.legend(loc=\"lower right\")\n", - "plt.show()\n", - "\n", - "\n", - "#code for ROC curve, for false values\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "RAWpzQjReQxk" + }, + "outputs": [], + "source": [ + "# code for ROC curve, for false values.\n", "plt.figure()\n", "lw = 2\n", "label_invert = y.apply(switch_label)\n", "fpr, tpr, thresholds = metrics.roc_curve(label_invert, scores_invert)\n", "plt.plot(fpr, tpr, color='darkorange',\n", - " lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n", + " lw=lw, label='ROC curve (area=%0.2f)' % roc_auc)\n", "plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n", "plt.xlim([0.0, 1.0])\n", "plt.ylim([0.0, 1.05])\n", @@ -1118,14 +1407,21 @@ "plt.ylabel('True Positive Rate')\n", "plt.title('Receiver operating characteristic for False')\n", "plt.legend(loc=\"lower right\")\n", - "plt.show()\n", - "\n", - "\n", - "#code for PR curve, for true values\n", - "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "dcoUEakxeXKe" + }, + "outputs": [], + "source": [ + "# code for PR curve, for true values.\n", "precision, recall, thresholds = metrics.precision_recall_curve(y, scores)\n", - "\n", - "\n", "plt.figure()\n", "lw = 2\n", "plt.plot( recall, precision, color='darkorange',\n", @@ -1136,11 +1432,23 @@ "plt.ylabel('Precision')\n", "plt.title('Precision Recall Curve for True')\n", "plt.legend(loc=\"lower right\")\n", - "plt.show()\n", - "\n", - "#code for PR curve, for false values\n", - "\n", - "precision, recall, thresholds = metrics.precision_recall_curve(label_invert, scores_invert)\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "both", + "colab": {}, + "colab_type": "code", + "id": "wx-hFytjwiLJ" + }, + "outputs": [], + "source": [ + "# code for PR curve, for false values.\n", + "precision, recall, thresholds = metrics.precision_recall_curve(\n", + " label_invert, scores_invert)\n", "print(precision.shape)\n", "print(recall.shape)\n", "\n", @@ -1154,18 +1462,51 @@ "plt.ylabel('Precision')\n", "plt.title('Precision Recall Curve for False')\n", "plt.legend(loc=\"lower right\")\n", - "plt.show()\n", - "\n" + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "HAivzUjcVJgT" + }, + "source": [ + "## **Cleaning up**\n", + "\n", + "To clean up all GCP resources used in this project, you can [delete the GCP\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "sx_vKniMq9ZX" + }, + "outputs": [], + "source": [ + "# Delete model resource.\n", + "tables_client.delete_model(model_name=model_name)\n", + "\n", + "# Delete dataset resource.\n", + "tables_client.delete_dataset(dataset_name=dataset_name)\n", + "\n", + "# Delete Cloud Storage objects that were created.\n", + "! gsutil -m rm -r gs://$BUCKET_NAME\n", + "\n", + "# If training model is still running, cancel it.\n", + "automl_client.transport._operations_client.cancel_operation(operation_id)" ] } ], "metadata": { - "accelerator": "GPU", "colab": { "collapsed_sections": [], - "name": "colab_C4M.ipynb", - "provenance": [], - "version": "0.3.2" + "name": "purchase_prediction.ipynb", + "provenance": [] }, "kernelspec": { "display_name": "Python 3", @@ -1182,9 +1523,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.7" + "version": "3.5.3" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/tables/automl/notebooks/result_slicing/slicing_eval_results.ipynb b/tables/automl/notebooks/result_slicing/slicing_eval_results.ipynb index 32a0f568b12d..d3fe030c8451 100644 --- a/tables/automl/notebooks/result_slicing/slicing_eval_results.ipynb +++ b/tables/automl/notebooks/result_slicing/slicing_eval_results.ipynb @@ -1,45 +1,514 @@ { "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ur8xi4C7S06n" + }, + "outputs": [], + "source": [ + "# Copyright 2019 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "HosWdaE-KieL" + }, + "source": [ + "# **Slicing AutoML Tables Evaluation Results with BigQuery**\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "MowcN4adM7eH" + }, + "source": [ + "## **Overview**\n", + "This colab assumes that you've created a dataset with AutoML Tables, and used that dataset to train a classification model. Once the model is done training, you also need to export the results table by using the following instructions. You'll see more detailed setup instructions below.\n", + "\n", + "This colab will walk you through the process of using BigQuery to visualize data slices, showing you one simple way to evaluate your model for bias.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "uLcF5EMyIDWe" + }, + "source": [ + "### **Dataset**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ECb9SlJLnajE" + }, + "source": [ + "\n", + "You'll need to use the AutoML Tables frontend or service to create a model and export its evaluation results to BigQuery. You should find a link on the Evaluate tab to view your evaluation results in BigQuery once you've finished training your model. Then navigate to BigQuery in your GCP console and you'll see your new results table in the list of tables to which your project has access.\n", + "\n", + "For demo purposes, we'll be using the [Default of Credit Card Clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) dataset for analysis.\n", + "\n", + "**Note:** Although the data we use in this demo is public, you'll need to enter your own Google Cloud project ID in the parameter below to authenticate to it." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "kXQXf1W8IKPK" + }, + "source": [ + "### **Objective**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "zndbRXq4ne8K" + }, + "source": [ + "\n", + "This dataset was collected to help compare different methods of predicting credit card default. Using this colab to analyze your own dataset may require a little adaptation.\n", + "The code below will sample if you want it to. Or you can set sample_count to be as large or larger than your dataset to use the whole thing for analysis.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "w4YELJp6O_xw" + }, + "source": [ + "### **Costs**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "74OP8KFwO_gs" + }, + "source": [ + "This tutorial uses billable components of Google Cloud Platform (GCP):\n", + "\n", + "* Cloud AI Platform\n", + "* BigQuery\n", + "\n", + "Learn about [Cloud AI Platform pricing](https://cloud.google.com/ml-engine/docs/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing) and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ze4-nDLfK4pw" + }, + "source": [ + "## **Set up your local development environment**\n", + "\n", + "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", + "all the requirements to run this notebook. If you are using **AI Platform Notebook**, make sure the machine configuration type is **1 vCPU, 3.75 GB RAM** or above and environment as **Python or TensorFlow Enterprise 1.15**. You can skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "gCuSR8GkAgzl" + }, + "source": [ + "**Otherwise**, make sure your environment meets this notebook's requirements.\n", + "You need the following:\n", + "\n", + "* The Google Cloud SDK\n", + "* Git\n", + "* Python 3\n", + "* virtualenv\n", + "* Jupyter notebook running in a virtual environment with Python 3\n", + "\n", + "The Google Cloud guide to [Setting up a Python development\n", + "environment](https://cloud.google.com/python/setup) and the [Jupyter\n", + "installation guide](https://jupyter.org/install) provide detailed instructions\n", + "for meeting these requirements. The following steps provide a condensed set of\n", + "instructions:\n", + "\n", + "1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)\n", + "\n", + "2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)\n", + "\n", + "3. [Install\n", + " virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)\n", + " and create a virtual environment that uses Python 3.\n", + "\n", + "4. Activate that environment and run `pip install jupyter` in a shell to install\n", + " Jupyter.\n", + "\n", + "5. Run `jupyter notebook` in a shell to launch Jupyter.\n", + "\n", + "6. Open this notebook in the Jupyter Notebook Dashboard." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "BF1j6f9HApxa" + }, + "source": [ + "## **Set up your GCP project**\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", + "\n", + "3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "N-nqtnSQRISO" + }, + "source": [ + "## **PIP Install Packages and dependencies**\n", + "\n", + "Install additional dependencies not installed in Notebook environment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "wyy5Lbnzg5fi" + }, + "outputs": [], + "source": [ + "! pip install --upgrade --quiet --user sklearn\n", + "! pip install --upgrade --quiet --user witwidget\n", + "! pip install --upgrade --quiet --user tensorflow==1.15\n", + "! pip install --upgrade --quiet --user tensorflow_model_analysis\n", + "! pip install --upgrade --quiet --user pandas-gbq" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "qjXdLSh9EHu7" + }, + "source": [ + "Note: Try installing using `sudo`, if the above command throw any permission errors. You can **ignore other errors** and continue to next steps." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "8KN_WEoGTMG_" + }, + "source": [ + "Skip the below cell if you are using Colab.\n", + "\n", + "If you are using **AI Notebook Platform > JupyterLab**. Install following packages.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "8UiMePfgTmfe" + }, + "outputs": [], + "source": [ + "! sudo jupyter labextension install wit-widget\n", + "! sudo jupyter labextension install @jupyter-widgets/jupyterlab-manager\n", + "! sudo jupyter labextension install wit-widget@1.3\n", + "! sudo jupyter labextension install jupyter-matplotlib" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "w1os-ocbUIpC" + }, + "source": [ + "Skip the below cell if you are using Colab.\n", + "\n", + "If you are using **AI Notebook Platform > Classic Notebook** or **Local Environment**. Install and enable following dependencies to link WitWidget and TFMA with notebook extensions.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "2GurPHG_UfBP" + }, + "outputs": [], + "source": [ + "! jupyter nbextension enable --py widgetsnbextension\n", + "! jupyter nbextension install --py --symlink tensorflow_model_analysis\n", + "! jupyter nbextension enable --py tensorflow_model_analysis" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "kK5JATKPNf3I" + }, + "source": [ + "**Note:** Try installing using `--user`, if the above command throw any permission errors." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "f-YlNVLTYXXN" + }, + "source": [ + "`Restart` the kernel to allow the libraries to be imported for Jupyter Notebooks.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "C16j_LPrYbZa" + }, + "outputs": [], + "source": [ + "from IPython.core.display import HTML\n", + "HTML(\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ekfuBcMyCrfu" + }, + "source": [ + "`Refresh` the browser for visualization while running in Jupyter Notebooks" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "chUbwXRjP2UU" + }, + "source": [ + "## **Set up your GCP Project Id**\n", + "\n", + "Enter your `Project Id` in the cell below. Then run the cell to make sure the\n", + "Cloud SDK uses the right project for all the commands in this notebook.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "oM1iC_MfAts1" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"[your-project-id]\" #@param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "dr--iN2kAylZ" + }, + "source": [ + "## **Authenticate your GCP account**\n", + "\n", + "**If you are using AI Platform Notebooks**, your environment is already\n", + "authenticated. Skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3yyVCJHFSEKG" + }, + "source": [ + "Otherwise, follow these steps:\n", + "\n", + "1. In the GCP Console, go to the [**Create service account key**\n", + " page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n", + "\n", + "2. From the **Service account** drop-down list, select **New service account**.\n", + "\n", + "3. In the **Service account name** field, enter a name.\n", + "\n", + "4. From the **Role** drop-down list, select\n", + " **BigQuery > BigQuery User**.\n", + "\n", + "5. Click *Create*. A JSON file that contains your key downloads to your\n", + "local environment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Yt6PhVG0UdF1" + }, + "source": [ + "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "q5TeVHKDMOJF" + }, + "outputs": [], + "source": [ + "import sys\n", + "\n", + "# Upload the downloaded JSON file that contains your key.\n", + "if 'google.colab' in sys.modules: \n", + " from google.colab import files\n", + " keyfile_upload = files.upload()\n", + " keyfile = list(keyfile_upload.keys())[0]\n", + " %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n", + " ! gcloud auth activate-service-account --key-file $keyfile" + ] + }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "jt_Hqb95fRz8" + "id": "d1bnPeDVMR5Q" }, "source": [ - "# Slicing AutoML Tables Evaluation Results with BigQuery\n", - "\n", - "This colab assumes that you've created a dataset with AutoML Tables, and used that dataset to train a classification model. Once the model is done training, you also need to export the results table by using the following instructions. You'll see more detailed setup instructions below.\n", - "\n", - "This colab will walk you through the process of using BigQuery to visualize data slices, showing you one simple way to evaluate your model for bias.\n", - "\n", - "## Setup\n", - "\n", - "To use this Colab, copy it to your own Google Drive or open it in the Playground mode. Follow the instructions in the [AutoML Tables Product docs](https://cloud.google.com/automl-tables/docs/) to create a GCP project, enable the API, and create and download a service account private key, and set up required permission. You'll also need to use the AutoML Tables frontend or service to create a model and export its evaluation results to BigQuery. You should find a link on the Evaluate tab to view your evaluation results in BigQuery once you've finished training your model. Then navigate to BigQuery in your GCP console and you'll see your new results table in the list of tables to which your project has access. \n", - "\n", - "For demo purposes, we'll be using the [Default of Credit Card Clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) dataset for analysis. This dataset was collected to help compare different methods of predicting credit card default. Using this colab to analyze your own dataset may require a little adaptation.\n", - "\n", - "The code below will sample if you want it to. Or you can set sample_count to be as large or larger than your dataset to use the whole thing for analysis. \n", + "***If you are running the notebook locally***, enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "fsVNKXESYoeQ" + }, + "outputs": [], + "source": [ + "# If you are running this notebook locally, replace the string below with the\n", + "# path to your service account key and run this cell to authenticate your GCP\n", + "# account.\n", "\n", - "Note also that although the data we use in this demo is public, you'll need to enter your own Google Cloud project ID in the parameter below to authenticate to it.\n", - "\n" + "%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account\n", + "! gcloud auth activate-service-account --key-file '/path/to/service/account'" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "XoEqT2Y4DJmf" + }, + "source": [ + "## **Import libraries and define constants**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "tR6KXS3dJ3sx" + }, + "source": [ + "Import relevant packages.\n" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", - "id": "m2oL8tO-f9rK" + "id": "pRUOFELefqf1" }, "outputs": [], "source": [ "from __future__ import absolute_import\n", "from __future__ import division\n", - "from __future__ import print_function\n", - "\n", - "from google.colab import auth\n", + "from __future__ import print_function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "LdWSxWQWMm1w" + }, + "outputs": [], + "source": [ "import numpy as np\n", "import os\n", "import pandas as pd\n", @@ -48,66 +517,213 @@ "from sklearn.metrics import confusion_matrix\n", "from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score\n", "from sklearn.metrics import precision_recall_curve\n", - "# For facets\n", + "from collections import OrderedDict" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "1GSz7kjjMjVP" + }, + "outputs": [], + "source": [ + "# For facets.\n", "from IPython.core.display import display, HTML\n", "import base64\n", - "!pip install --upgrade tf-nightly witwidget\n", - "import witwidget.notebook.visualization as visualization\n", - "!pip install apache-beam\n", - "!pip install --upgrade tensorflow_model_analysis\n", - "!pip install --upgrade tensorflow\n", - "\n", + "import witwidget.notebook.visualization as visualization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "cA1vWKC3MeQn" + }, + "outputs": [], + "source": [ + "# Tensorflow model analysis\n", + "import apache_beam as beam\n", + "import tempfile\n", + "from google.protobuf import text_format\n", + "from tensorflow_model_analysis import post_export_metrics\n", + "from tensorflow_model_analysis import types\n", + "from tensorflow_model_analysis.api import model_eval_lib\n", + "from tensorflow_model_analysis.evaluators import aggregate\n", + "from tensorflow_model_analysis.extractors import slice_key_extractor\n", + "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_evaluate_graph\n", + "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_extractor\n", + "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_predict\n", + "from tensorflow_model_analysis.proto import metrics_for_slice_pb2\n", + "from tensorflow_model_analysis import slicer\n", + "from tensorflow_model_analysis.view.widget_view import render_slicing_metrics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "TNQ2gDEWMgXC" + }, + "outputs": [], + "source": [ + "# Tensorflow versions\n", "import tensorflow as tf\n", + "print('Tensorflow version: {}'.format(tf.__version__))\n", "import tensorflow_model_analysis as tfma\n", - "print('TFMA version: {}'.format(tfma.version.VERSION_STRING))\n", + "print('TFMA version: {}'.format(tfma.version.VERSION_STRING))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "mmsqduL8Jhck" + }, + "source": [ + "Populate the following cell with the necessary constants and run it to initialize constants." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "OX05mmN5SNv6" + }, + "outputs": [], + "source": [ + "#@title Constants { vertical-output: true }\n", + "\n", + "TABLE_NAME = 'bigquery-public-data.ml_datasets.credit_card_default' #@param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "77Km8lS2Kctp" + }, + "source": [ + "## **Query Dataset**" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "L41KlPaPSROt" + }, + "outputs": [], + "source": [ + "sample_count = 3000 #@param {type:\"integer\"}\n", "\n", - "# https://cloud.google.com/resource-manager/docs/creating-managing-projects\n", - "project_id = '[YOUR PROJECT ID HERE]' #@param {type:\"string\"}\n", - "table_name = 'bigquery-public-data:ml_datasets.credit_card_default' #@param {type:\"string\"}\n", - "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id\n", - "sample_count = 3000 #@param\n", "row_count = pd.io.gbq.read_gbq('''\n", " SELECT \n", " COUNT(*) as total\n", - " FROM [%s]''' % (table_name), project_id=project_id, verbose=False).total[0]\n", - "df = pd.io.gbq.read_gbq('''\n", + " FROM `%s`''' % (TABLE_NAME), project_id=PROJECT_ID, verbose=False).total[0]\n", + "nested_df = pd.io.gbq.read_gbq('''\n", " SELECT\n", " *\n", " FROM\n", - " [%s]\n", + " `%s`\n", " WHERE RAND() < %d/%d\n", - "''' % (table_name, sample_count, row_count), project_id=project_id, verbose=False)\n", + " ''' % (TABLE_NAME, sample_count, row_count), \n", + " project_id=PROJECT_ID, verbose=False)\n", + "\n", "print('Full dataset has %d rows' % row_count)\n", - "df.describe()" + "nested_df.describe()" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "608Fe8PRtj5q" + "id": "H0FK-2oiKnCE" }, "source": [ - "##Data Preprocessing\n", - "\n", - "Many of the tools we use to analyze models and data expect to find their inputs in the [tensorflow.Example](https://www.tensorflow.org/tutorials/load_data/tf_records) format. Here, we'll preprocess our data into tf.Examples, and also extract the predicted class from our classifier, which is binary." + "## **Unnest the columns**" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", - "id": "lqZeO9aGtn2s" + "id": "YJddw6ITNEWj" }, "outputs": [], "source": [ - "unique_id_field = 'ID' #@param\n", - "prediction_field_score = 'predicted_default_payment_next_month_tables_score' #@param\n", - "prediction_field_value = 'predicted_default_payment_next_month_tables_value' #@param\n", + "from collections import OrderedDict\n", + "import json\n", + "\n", + "def unnest_df(nested_df):\n", + " rows_list = []\n", + " for index, row in nested_df.iterrows():\n", + " for i in row[\"predicted_default_payment_next_month\"]:\n", + " row_dict = OrderedDict()\n", + " row_dict = json.loads(row.to_json())\n", + " row_dict[\"predicted_default_payment_next_month_tables_score\"] = i[\"tables\"][\"score\"]\n", + " row_dict[\"predicted_default_payment_next_month_tables_value\"] = i[\"tables\"][\"value\"]\n", + " rows_list.append(row_dict) \n", + "\n", + " unnested_df = pd.DataFrame(rows_list, columns=list(rows_list[0].keys()))\n", + " unnested_df = unnested_df.drop(\n", + " [\"predicted_default_payment_next_month\"], axis=1)\n", + " return unnested_df\n", "\n", + "df = unnest_df(nested_df)\n", + "print(\"Unnested completed\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "HzR2cIRMSwBt" + }, + "source": [ + "## **Data Preprocessing**\n", + "Many of the tools we use to analyze models and data expect to find their inputs in the [tensorflow.Example](https://www.tensorflow.org/tutorials/load_data/tfrecord) format. Here, we'll preprocess our data into tf. Examples, and also extract the predicted class from our classifier, which is binary." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "3RpRi-eHSoMD" + }, + "outputs": [], + "source": [ + "#@title Columns { vertical-output: true }\n", "\n", + "unique_id_field = 'id' #@param {type: 'string'}\n", + "prediction_field_score = 'predicted_default_payment_next_month_tables_score' #@param\n", + "prediction_field_value = 'predicted_default_payment_next_month_tables_value' #@param" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "_-IImPuzTiG-" + }, + "outputs": [], + "source": [ "def extract_top_class(prediction_tuples):\n", " # values from Tables show up as a CSV of individual json (prediction, confidence) objects.\n", " best_score = 0\n", @@ -116,19 +732,32 @@ " if sco > best_score:\n", " best_score = sco\n", " best_class = val\n", - " return (best_class, best_score)\n", - "\n", + " return (best_class, best_score)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "IMMv_1z2TiKT" + }, + "outputs": [], + "source": [ "def df_to_examples(df, columns=None):\n", " examples = []\n", " if columns == None:\n", " columns = df.columns.values.tolist()\n", " for id in df[unique_id_field].unique():\n", " example = tf.train.Example()\n", - " prediction_tuples = zip(df.loc[df[unique_id_field] == id][prediction_field_value], df.loc[df[unique_id_field] == id][prediction_field_score])\n", + " prediction_tuples = zip(\n", + " df.loc[df[unique_id_field] == id][prediction_field_value], \n", + " df.loc[df[unique_id_field] == id][prediction_field_score])\n", " row = df.loc[df[unique_id_field] == id].iloc[0]\n", " for col in columns:\n", " if col == prediction_field_score or col == prediction_field_value:\n", - " # Deal with prediction fields separately\n", + " # Deal with prediction fields separately.\n", " continue\n", " elif df[col].dtype is np.dtype(np.int64):\n", " example.features.feature[col].int64_list.value.append(int(row[col]))\n", @@ -137,23 +766,40 @@ " elif row[col] is None:\n", " continue\n", " elif row[col] == row[col]:\n", - " example.features.feature[col].bytes_list.value.append(row[col].encode('utf-8'))\n", + " example.features.feature[col].bytes_list.value.append(\n", + " row[col].encode('utf-8'))\n", " cla, sco = extract_top_class(prediction_tuples)\n", " example.features.feature['predicted_class'].int64_list.value.append(cla)\n", - " example.features.feature['predicted_class_score'].float_list.value.append(sco)\n", + " example.features.feature['predicted_class_score']\\\n", + " .float_list.value.append(sco)\n", " examples.append(example)\n", - " return examples\n", - "\n", - "# Fix up some types so analysis is consistent. This code is specific to the dataset.\n", - "df = df.astype({\"PAY_5\": float, \"PAY_6\": float})\n", + " return examples" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "hJPfTH-UTngy" + }, + "outputs": [], + "source": [ + "# Fix up some types so analysis is consistent. \n", + "# This code is specific to the dataset.\n", + "df = df.astype({\"pay_5\":float, \"pay_6\":float})\n", "\n", "# Converts a dataframe column into a column of 0's and 1's based on the provided test.\n", "def make_label_column_numeric(df, label_column, test):\n", " df[label_column] = np.where(test(df[label_column]), 1, 0)\n", " \n", "# Convert label types to numeric. This code is specific to the dataset.\n", - "make_label_column_numeric(df, 'predicted_default_payment_next_month_tables_value', lambda val: val == '1')\n", - "make_label_column_numeric(df, 'default_payment_next_month', lambda val: val == '1')\n", + "make_label_column_numeric(df, \n", + " 'predicted_default_payment_next_month_tables_value', \n", + " lambda val: val == '1')\n", + "make_label_column_numeric(df, 'default_payment_next_month', \n", + " lambda val: val == '1')\n", "\n", "examples = df_to_examples(df)\n", "print(\"Preprocessing complete!\")" @@ -163,81 +809,76 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "XwnOX_orVZEs" + "id": "sn7Y9In0TwJe" }, "source": [ - "## What-If Tool\n", - "\n", + "## **What-If Tool**\n", "First, we'll explore the data and predictions using the [What-If Tool](https://pair-code.github.io/what-if-tool/). The What-If tool is a powerful visual interface to explore data, models, and predictions. Because we're reading our results from BigQuery, we aren't able to use the features of the What-If Tool that query the model directly. But we can still learn a lot about this dataset from the exploration that the What-If tool enables.\n", "\n", - "Imagine that you're curious to discover whether there's a discrepancy in the predictive power of your model depending on the marital status of the person whose credit history is being analyzed. You can use the What-If Tool to look at a glance and see the relative sizes of the data samples for each class. In this dataset, the marital statuses are encoded as 1 = married; 2 = single; 3 = divorce; 0=others. You can see using the What-If Tool that there are very few samples for classes other than married or single, which might indicate that performance could be compromised. If this lack of representation concerns you, you could consider collecting more data for underrepresented classes, downsampling overrepresented classes, or upweighting underrepresented data types as you train, depending on your use case and data availability.\n" + "Imagine that you're curious to discover whether there's a discrepancy in the predictive power of your model depending on the marital status of the person whose credit history is being analyzed. You can use the What-If Tool to look at a glance and see the relative sizes of the data samples for each class. In this dataset, the marital statuses are encoded as 1 = married; 2 = single; 3 = divorce; 0=others. You can see using the What-If Tool that there are very few samples for classes other than married or single, which might indicate that performance could be compromised. If this lack of representation concerns you, you could consider collecting more data for underrepresented classes, downsampling overrepresented classes, or upweighting underrepresented data types as you train, depending on your use case and data availability." ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", - "id": "tjWxGOBkVXQ6" + "id": "FXOrJyh5TzQw" }, "outputs": [], "source": [ + "#@title WitWidget Configuration { vertical-output: false }\n", + "\n", "WitWidget = visualization.WitWidget\n", "WitConfigBuilder = visualization.WitConfigBuilder\n", "\n", "num_datapoints = 2965 #@param {type: \"number\"}\n", "tool_height_in_px = 700 #@param {type: \"number\"}\n", "\n", - "# Setup the tool with the test examples and the trained classifier\n", + "# Setup the tool with the test examples and the trained classifier.\n", "config_builder = WitConfigBuilder(examples[:num_datapoints])\n", - "# Need to call this so we have inference_address and model_name initialized\n", + "# Need to call this so we have inference_address and model_name initialized.\n", "config_builder = config_builder.set_estimator_and_feature_spec('', '')\n", - "config_builder = config_builder.set_compare_estimator_and_feature_spec('', '')\n", - "wv = WitWidget(config_builder, height=tool_height_in_px)" + "config_builder = config_builder.set_compare_estimator_and_feature_spec('', '')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Qfmr0cQCGBmu" + }, + "outputs": [], + "source": [ + "WitWidget(config_builder, height=tool_height_in_px)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "YHydLAY991Du" + "id": "n3u5yCG2T8zz" }, "source": [ - "## Tensorflow Model Analysis\n", - "\n", - "Then, let's examine some sliced metrics. This section of the tutorial will use [TFMA](https://github.com/tensorflow/model-analysis) model agnostic analysis capabilities. \n", + "## **Tensorflow Model Analysis** \n", + "Then, let's examine some sliced metrics. This section of the tutorial will use [TFMA](https://github.com/tensorflow/model-analysis) model agnostic analysis capabilities.\n", "\n", - "TFMA generates sliced metrics graphs and confusion matrices. We can use these to dig deeper into the question of how well this model performs on different classes of marital status. The model was built to optimize for AUC ROC metric, and it does fairly well for all of the classes, though there is a small performance gap for the \"divorced\" category. But when we look at the AUC-PR metric slices, we can see that the \"divorced\" and \"other\" classes are very poorly served by the model compared to the more common classes. AUC-PR is the metric that measures how well the tradeoff between precision and recall is being made in the model's predictions. If we're concerned about this gap, we could consider retraining to use AUC-PR as the optimization metric and see whether that model does a better job making equitable predictions. " + "TFMA generates sliced metrics graphs and confusion matrices. We can use these to dig deeper into the question of how well this model performs on different classes of marital status. The model was built to optimize for AUC ROC metric, and it does fairly well for all of the classes, though there is a small performance gap for the \"divorced\" category. But when we look at the AUC-PR metric slices, we can see that the \"divorced\" and \"other\" classes are very poorly served by the model compared to the more common classes. AUC-PR is the metric that measures how well the tradeoff between precision and recall is being made in the model's predictions. If we're concerned about this gap, we could consider retraining to use AUC-PR as the optimization metric and see whether that model does a better job making equitable predictions" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", - "id": "ZfU11b0797le" + "id": "WVV0XThVadZM" }, "outputs": [], "source": [ - "import apache_beam as beam\n", - "import tempfile\n", - "\n", - "from collections import OrderedDict\n", - "from google.protobuf import text_format\n", - "from tensorflow_model_analysis import post_export_metrics\n", - "from tensorflow_model_analysis import types\n", - "from tensorflow_model_analysis.api import model_eval_lib\n", - "from tensorflow_model_analysis.evaluators import aggregate\n", - "from tensorflow_model_analysis.extractors import slice_key_extractor\n", - "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_evaluate_graph\n", - "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_extractor\n", - "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_predict\n", - "from tensorflow_model_analysis.proto import metrics_for_slice_pb2\n", - "from tensorflow_model_analysis.slicer import slicer\n", - "from tensorflow_model_analysis.view.widget_view import render_slicing_metrics\n", - "\n", "# To set up model agnostic extraction, need to specify features and labels of\n", "# interest in a feature map.\n", "feature_map = OrderedDict();\n", @@ -247,119 +888,136 @@ " if column == prediction_field_score or column == prediction_field_value:\n", " continue\n", " elif (type == np.dtype(np.float64)):\n", - " feature_map[column] = tf.FixedLenFeature([], tf.float32)\n", + " feature_map[column] = tf.io.FixedLenFeature([], tf.float32)\n", " elif (type == np.dtype(np.object)):\n", - " feature_map[column] = tf.FixedLenFeature([], tf.string)\n", + " feature_map[column] = tf.io.FixedLenFeature([], tf.string)\n", " elif (type == np.dtype(np.int64)):\n", - " feature_map[column] = tf.FixedLenFeature([], tf.int64)\n", + " feature_map[column] = tf.io.FixedLenFeature([], tf.int64)\n", " elif (type == np.dtype(np.bool)):\n", - " feature_map[column] = tf.FixedLenFeature([], tf.bool)\n", + " feature_map[column] = tf.io.FixedLenFeature([], tf.bool)\n", " elif (type == np.dtype(np.datetime64)):\n", - " feature_map[column] = tf.FixedLenFeature([], tf.timestamp)\n", - "\n", - "feature_map['predicted_class'] = tf.FixedLenFeature([], tf.int64)\n", - "feature_map['predicted_class_score'] = tf.FixedLenFeature([], tf.float32)\n", + " feature_map[column] = tf.io.FixedLenFeature([], tf.timestamp)\n", "\n", - "serialized_examples = [e.SerializeToString() for e in examples]\n", + "feature_map['predicted_class'] = tf.io.FixedLenFeature([], tf.int64)\n", + "feature_map['predicted_class_score'] = tf.io.FixedLenFeature([], tf.float32)\n", "\n", + "serialized_examples = [e.SerializeToString() for e in examples]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "36eU_bZjf0ci" + }, + "outputs": [], + "source": [ "BASE_DIR = tempfile.gettempdir()\n", - "OUTPUT_DIR = os.path.join(BASE_DIR, 'output')\n", - "\n", - "slice_column = 'MARRIAGE' #@param\n", - "predicted_labels = 'predicted_class' #@param\n", - "actual_labels = 'default_payment_next_month' #@param\n", - "predicted_class_score = 'predicted_class_score' #@param\n", - "\n", - "with beam.Pipeline() as pipeline:\n", - " model_agnostic_config = model_agnostic_predict.ModelAgnosticConfig(\n", - " label_keys=[actual_labels],\n", - " prediction_keys=[predicted_labels],\n", - " feature_spec=feature_map)\n", - " \n", - " extractors = [\n", - " model_agnostic_extractor.ModelAgnosticExtractor(\n", - " model_agnostic_config=model_agnostic_config,\n", - " desired_batch_size=3),\n", - " slice_key_extractor.SliceKeyExtractor([\n", - " slicer.SingleSliceSpec(columns=[slice_column])\n", - " ])\n", - " ]\n", - "\n", - " auc_roc_callback = post_export_metrics.auc(\n", - " labels_key=actual_labels,\n", - " target_prediction_keys=[predicted_labels])\n", - " \n", - " auc_pr_callback = post_export_metrics.auc(\n", - " curve='PR',\n", - " labels_key=actual_labels,\n", - " target_prediction_keys=[predicted_labels])\n", - " \n", - " confusion_matrix_callback = post_export_metrics.confusion_matrix_at_thresholds(\n", - " labels_key=actual_labels,\n", - " target_prediction_keys=[predicted_labels],\n", - " example_weight_key=predicted_class_score,\n", - " thresholds=[0.0, 0.5, 0.8, 1.0])\n", - "\n", - " # Create our model agnostic aggregator.\n", - " eval_shared_model = types.EvalSharedModel(\n", - " construct_fn=model_agnostic_evaluate_graph.make_construct_fn(\n", - " add_metrics_callbacks=[confusion_matrix_callback,\n", - " auc_roc_callback,\n", - " auc_pr_callback,\n", - " post_export_metrics.example_count()],\n", - " fpl_feed_config=model_agnostic_extractor\n", - " .ModelAgnosticGetFPLFeedConfig(model_agnostic_config)))\n", - "\n", - " # Run Model Agnostic Eval.\n", - " _ = (\n", - " pipeline\n", - " | beam.Create(serialized_examples)\n", - " | 'ExtractEvaluateAndWriteResults' >>\n", - " model_eval_lib.ExtractEvaluateAndWriteResults(\n", - " eval_shared_model=eval_shared_model,\n", - " output_path=OUTPUT_DIR,\n", - " extractors=extractors))\n", - " \n", - "\n", - "eval_result = tfma.load_eval_result(output_path=OUTPUT_DIR)\n", - "render_slicing_metrics(eval_result, slicing_column = slice_column)" + "OUTPUT_DIR = os.path.join(BASE_DIR, 'output')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "aMlNa-UQPg-n" + }, + "outputs": [], + "source": [ + "#@title TFMA Inputs { vertical-output: false }\n", + "\n", + "slice_column = 'marital_status' #@param {type: 'string'}\n", + "predicted_labels = 'predicted_class' #@param {type: 'string'}\n", + "actual_labels = 'default_payment_next_month' #@param {type: 'string'}\n", + "predicted_class_score = 'predicted_class_score' #@param {type: 'string'}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "1avSDsaVPrwb" + }, + "outputs": [], + "source": [ + " with beam.Pipeline() as pipeline:\n", + " model_agnostic_config = model_agnostic_predict.ModelAgnosticConfig(\n", + " label_keys=[actual_labels],\n", + " prediction_keys=[predicted_labels],\n", + " feature_spec=feature_map)\n", + "\n", + " extractors = [\n", + " model_agnostic_extractor.ModelAgnosticExtractor(\n", + " model_agnostic_config=model_agnostic_config,\n", + " desired_batch_size=3),\n", + " slice_key_extractor.SliceKeyExtractor([\n", + " slicer.SingleSliceSpec(columns=[slice_column])\n", + " ])\n", + " ]\n", + "\n", + " auc_roc_callback = post_export_metrics.auc(\n", + " labels_key=actual_labels,\n", + " target_prediction_keys=[predicted_labels])\n", + "\n", + " auc_pr_callback = post_export_metrics.auc(\n", + " curve='PR',\n", + " labels_key=actual_labels,\n", + " target_prediction_keys=[predicted_labels])\n", + "\n", + " confusion_matrix_callback = post_export_metrics\\\n", + " .confusion_matrix_at_thresholds(\n", + " labels_key=actual_labels,\n", + " target_prediction_keys=[predicted_labels],\n", + " example_weight_key=predicted_class_score,\n", + " thresholds=[0.0, 0.5, 0.8, 1.0])\n", + "\n", + " # Create our model agnostic aggregator.\n", + " eval_shared_model = types.EvalSharedModel(\n", + " construct_fn=model_agnostic_evaluate_graph.make_construct_fn(\n", + " add_metrics_callbacks=[confusion_matrix_callback,\n", + " auc_roc_callback,\n", + " auc_pr_callback,\n", + " post_export_metrics.example_count()],\n", + " config=model_agnostic_config))\n", + "\n", + " # Run Model Agnostic Eval.\n", + " _ = (\n", + " pipeline\n", + " | beam.Create(serialized_examples)\n", + " | 'ExtractEvaluateAndWriteResults' >>\n", + " model_eval_lib.ExtractEvaluateAndWriteResults(\n", + " eval_shared_model=eval_shared_model,\n", + " output_path=OUTPUT_DIR,\n", + " extractors=extractors))\n", + "\n", + "eval_result = tfma.load_eval_result(output_path=OUTPUT_DIR)" ] }, { "cell_type": "code", - "execution_count": 0, + "execution_count": null, "metadata": { "colab": {}, "colab_type": "code", - "id": "mOotC2D5Onqu" + "id": "B0OFjuIbF_jz" }, "outputs": [], - "source": [] + "source": [ + "render_slicing_metrics(eval_result, slicing_column=slice_column)" + ] } ], "metadata": { "colab": { "collapsed_sections": [], - "last_runtime": { - "build_target": "//learning/fairness/colabs:ml_fairness_notebook", - "kind": "shared" - }, "name": "slicing_eval_results.ipynb", - "provenance": [ - { - "file_id": "1goi268plF-1AJ77xjdMwIpapBr1ssb-q", - "timestamp": 1551899111384 - }, - { - "file_id": "/piper/depot/google3/cloud/ml/autoflow/colab/slicing_eval_results.ipynb?workspaceId=simonewu:autoflow-1::citc", - "timestamp": 1547767618990 - }, - { - "file_id": "1fjkKgZq5iMevPnfiIpSHSiSiw5XimZ1C", - "timestamp": 1547596565571 - } - ], - "version": "0.3.2" + "provenance": [] }, "kernelspec": { "display_name": "Python 3", @@ -376,9 +1034,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.7" + "version": "3.5.3" } }, "nbformat": 4, - "nbformat_minor": 1 + "nbformat_minor": 4 } diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb b/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb index b984679ff95d..9b787feca128 100644 --- a/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb +++ b/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb @@ -3,7 +3,11 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "ur8xi4C7S06n" + }, "outputs": [], "source": [ "# Copyright 2019 Google LLC\n", @@ -12,7 +16,7 @@ "# you may not use this file except in compliance with the License.\n", "# You may obtain a copy of the License at\n", "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", + "# https://www.apache.org/licenses/LICENSE-2.0 \n", "#\n", "# Unless required by applicable law or agreed to in writing, software\n", "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", @@ -23,24 +27,21 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "HosWdaE-KieL" + }, "source": [ - "# Retail Product Stockouts Prediction using AutoML Tables\n", + "# **Retail Product Stockouts Prediction using AutoML Tables**\n", "\n", "\n", " \n", - " \n", "
\n", - " \n", - " \"Google Read on cloud.google.com\n", - " \n", - " \n", - " \n", + " \n", " \"Colab Run in Colab\n", " \n", " \n", - " \n", + " \n", " \"GitHub\n", " View on GitHub\n", " \n", @@ -52,183 +53,424 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "m26YhtBMvVWA" + "id": "tvgnzT1CKxrO" }, "source": [ - "# Overview\n", + "## **Overview**\n", "\n", - "AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This Notebook demonstrates how you can use AutoML Tables to solve a product stockouts problem in the retail industry. This problem is solved using a binary classification approach, which predicts whether a particular product at a certain store will be out-of-stock or not in the next four weeks. Once the solution is built, you can plug this in with your production system and proactively predict stock-outs for your business." + "AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This Notebook demonstrates how you can use AutoML Tables to solve a product stockouts problem in the retail industry. This problem is solved using a binary classification approach, which predicts whether a particular product at a certain store will be out-of-stock or not in the next four weeks. Once the solution is built, you can plug this in with your production system and proactively predict stock-outs for your business.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "EvDhOgYL8V_K" + }, + "source": [ + "### **Dataset**" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "m26YhtBMvVWA" + "id": "8pfea4To7XBv" }, "source": [ - "## Objective\n", + "In this solution, you will use two datasets: Training/Evaluation data and Batch Prediction inputs. To access the datasets in BigQuery, you need the following information.\n", "\n", - "### Problem statement\n", + "##### **Training/Evaluation dataset**\n", "\n", - "A stockout, or out-of-stock (OOS) event is an event that causes inventory to be exhausted. While out-of-stocks can occur along the entire supply chain, the most visible kind are retail out-of-stocks in the fast-moving consumer goods industry (e.g., sweets, diapers, fruits). Stockouts are the opposite of overstocks, where too much inventory is retained.\n", + " * `Project ID: product-stockout`\n", + " * `Dataset ID: product_stockout`\n", + " * `Table ID: stockout`\n", + " \n", + "##### **Batch Prediction inputs**\n", "\n", - "### Impact\n", + " * `Project ID: product-stockout`\n", + " * `Dataset ID: product_stockout`\n", + " * `Table ID: batch_prediction_inputs`\n", "\n", - "According to a study by researchers Thomas Gruen and Daniel Corsten, the global average level of out-of-stocks within retail fast-moving consumer goods sector across developed economies was 8.3% in 2002. This means that shoppers would have a 42% chance of fulfilling a ten-item shopping list without encountering a stockout. Despite the initiatives designed to improve the collaboration of retailers and their suppliers, such as Efficient Consumer Response (ECR), and despite the increasing use of new technologies such as radio-frequency identification (RFID) and point-of-sale data analytics, this situation has improved little over the past decades.\n", + "##### **Data Schema**\n", "\n", - "The biggest impacts being\n", - "1. Customer dissatisfaction\n", - "2. Loss of revenue\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + "\n", + "\t\n", + "\t\n", + "\t\n", + "\t\n", + "\n", + " \n", + "
Field name Datatype Type Description
Item_NumberSTRINGIdentifierThis is the product/ item identifier
CategorySTRINGIdentifierSeveral items could belong to one category
Vendor_NumberSTRINGIdentifierProduct vendor identifier
Store_NumberSTRINGIdentifierStore identifier
Item_DescriptionSTRINGText FeaturesItem Description
Category_NameSTRINGText FeaturesCategory Name
Vendor_NameSTRINGText FeaturesVendor Name
Store_NameSTRINGText FeaturesStore Name
AddressSTRINGText FeaturesAddress
CitySTRINGCategorical FeaturesCity
Zip_CodeSTRINGCategorical FeaturesZip-code
Store_LocationSTRINGCategorical FeaturesStore Location
County_NumberSTRINGCategorical FeaturesCounty Number
CountySTRINGCategorical FeaturesCounty Name
Weekly Sales QuantityINTEGERTime series data52 columns for weekly sales quantity from week 1 to week 52
Weekly Sales DollarsINTEGERTime series data52 columns for weekly sales dollars from week 1 to week 52
InventoryFLOATNumeric FeatureThis inventory is stocked by the retailer looking at past sales and seasonality of the product to meet demand for future sales.
StockoutINTEGERLabel(1 - Stock-out, 0 - No stock-out) When the demand for four weeks future sales is not met by the inventory in stock we say we see a stock-out.\n", + "
This is because an early warning sign would help the retailer re-stock inventory with a lead time for the stock to be replenished.

\n", + "To use AutoML Tables with BigQuery you do not need to download this dataset. However, if you would like to use AutoML Tables with GCS you may want to download this dataset and upload it into your GCP Project storage bucket. \n", "\n", - "### Machine Learning Solution\n", + "**Instructions to download dataset:**\n", "\n", - "Using machine learning to solve for stock-outs can help with store operations and thus prevent out-of-stock proactively.\n", + "1. Sample Dataset: Download this dataset which contains sales data.\n", "\n", - "There are three big challenges any retailer would face as they try and solve this problem with machine learning:\n", + "\t* [Link to training data](https://console.cloud.google.com/bigquery?folder=&organizationId=&project=product-stockout&p=product-stockout&d=product_stockout&t=stockout&page=table): \n", "\n", - "1. Data silos: Sales data, supply-chain data, inventory data, etc. may all be in silos. Such disjoint datasets could be a challenge to work with as a machine learning model tries to derive insights from all these data points. \n", - "2. Missing Features: Features such as vendor location, weather conditions, etc. could add a lot of value to a machine learning algorithm to learn from. But such features are not always available and when building machine learning solutions we think for collecting features as an iterative approach to improving the machine learning model.\n", - "3. Imbalanced dataset: Datasets for classification problems such as retail stock-out are traditionally very imbalanced with fewer cases for stock-out. Designing machine learning solutions by hand for such problems would be time consuming effort when your team should be focusing on collecting features.\n", + "\t\tDataset URI: \n", + "\t* [Link to data for batch predictions](https://console.cloud.google.com/bigquery?folder=&organizationId=&project=product-stockout&p=product-stockout&d=product_stockout&t=batch_prediction_inputs&page=table): \n", "\n", - "Hence, we recommend using AutoML Tables. With AutoML Tables you only need to work on acquiring all data and features, and AutoML Tables would do the rest. This is a one-click deploy to solving the problem of stock-out with machine learning." + "\t\tDataset URI: \n", + "\n", + "2. Upload this dataset to GCS or BigQuery (optional). \n", + "\n", + "\t* You could select either [GCS](https://cloud.google.com/storage/) or [BigQuery](https://cloud.google.com/bigquery/) as the location of your choice to store the data for this challenge. \n", + "\n", + "\t\t1. Storing data on GCS: [Creating storage buckets, Uploading data to storage buckets](https://cloud.google.com/storage/docs/creating-buckets)\n", + "\t\t2. Storing data on BigQuery: [Create and load data to BigQuery](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui) (optional)\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "m26YhtBMvVWA" + "id": "AD0-cRZ28MxI" }, "source": [ - "## Dataset\n", - "\n", - "In this solution, you will use two datasets: Training/Evaluation data and Batch Prediction inputs. To access the datasets in BigQuery, you need the following information. \n", - "\n", - "Training/Evaluation dataset: \n", + "### **Objective**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "TAqUXPbG7pV3" + }, + "source": [ + "#### **Problem statement**\n", + "A stockout, or out-of-stock (OOS) event is an event that causes inventory to be exhausted. While out-of-stocks can occur along the entire supply chain, the most visible kind are retail out-of-stocks in the fast-moving consumer goods industry (e.g., sweets, diapers, fruits). Stockouts are the opposite of overstocks, where too much inventory is retained.\n", "\n", - "`Project ID: product-stockout`\n", + "#### **Impact**\n", + "According to a study by researchers Thomas Gruen and Daniel Corsten, the global average level of out-of-stocks within retail fast-moving consumer goods sector across developed economies was 8.3% in 2002. This means that shoppers would have a 42% chance of fulfilling a ten-item shopping list without encountering a stockout. Despite the initiatives designed to improve the collaboration of retailers and their suppliers, such as Efficient Consumer Response (ECR), and despite the increasing use of new technologies such as radio-frequency identification (RFID) and point-of-sale data analytics, this situation has improved little over the past decades.\n", "\n", - "`Dataset ID: product_stockout`\n", + "The biggest impacts being\n", "\n", - "`Table ID: stockout`\n", + "* Customer dissatisfaction\n", + "* Loss of revenue\n", "\n", - "Batch Prediction inputs: \n", "\n", - "`Project ID: product-stockout`\n", "\n", - "`Dataset ID: product_stockout`\n", + "#### **Machine Learning Solution**\n", + "Using machine learning to solve for stock-outs can help with store operations and thus prevent out-of-stock proactively.\n", "\n", - "`Table ID: batch_prediction_inputs`\n", + "There are three big challenges any retailer would face as they try and solve this problem with machine learning:\n", "\n", - "### Data Schema\n", + "1. Data silos: Sales data, supply-chain data, inventory data, etc. may all be in silos. Such disjoint datasets could be a challenge to work with as a machine learning model tries to derive insights from all these data points.\n", + "2. Missing Features: Features such as vendor location, weather conditions, etc. could add a lot of value to a machine learning algorithm to learn from. But such features are not always available and when building machine learning solutions we think for collecting features as an iterative approach to improving the machine learning model.\n", + "3. Imbalanced dataset: Datasets for classification problems such as retail stock-out are traditionally very imbalanced with fewer cases for stock-out. Designing machine learning solutions by hand for such problems would be time consuming effort when your team should be focusing on collecting features.\n", "\n", - "|Field name \t|Datatype \t|Type \t|Description \t|\n", - "|---\t|---\t|---\t|---\t|\n", - "|Item_Number |STRING |Identifier |This is the product/ item identifier |\n", - "|Category |STRING |Identifier\t|Several items could belong to one category |\n", - "|Vendor_Number\t|STRING\t|Identifier\t|Product vendor identifier |\n", - "|Store_Number\t|STRING\t|Identifier\t|Store identifier |\n", - "|Item_Description\t|STRING\t|Text Features\t|Item Description |\n", - "|Category_Name\t|STRING\t|Text Features\t|Category Name |\n", - "|Vendor_Name\t|STRING\t|Text Features\t|Vendor Name |\n", - "|Store_Name\t|STRING\t|Text Features\t|Store Name |\n", - "|Address\t|STRING\t|Text Features\t|Address |\n", - "|City\t|STRING\t|Categorical Features\t|City |\n", - "|Zip_Code\t|STRING\t|Categorical Features\t|Zip-code |\n", - "|Store_Location\t|STRING\t|Categorical Features\t|Store Location |\n", - "|County_Number\t|STRING\t|Categorical Features\t|County Number |\n", - "|County\t|STRING\t|Categorical Features\t|County Name |\n", - "|Weekly Sales Quantity |INTEGER\t|Time series data\t|52 columns for weekly sales quantity from week 1 to week 52 |\n", - "|Weekly Sales Dollars\t|INTEGER\t|Time series data\t|52 columns for weekly sales dollars from week 1 to week 52 |\n", - "|Inventory\t|FLOAT\t|Numeric Feature\t|This inventory is stocked by the retailer looking at past sales and seasonality of the product to meet demand for future sales. |\n", - "|Stockout\t|INTEGER\t|Label\t|(1 - Stock-out, 0 - No stock-out) When the demand for four weeks future sales is not met by the inventory in stock we say we see a stock-out. This is because an early warning sign would help the retailer re-stock inventory with a lead time for the stock to be replenished. |\n", + "Hence, we recommend using AutoML Tables. With AutoML Tables you only need to work on acquiring all data and features, and AutoML Tables would do the rest. This is a one-click deploy to solving the problem of stock-out with machine learning." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "SLq3FfRa8E8X" + }, + "source": [ + "### **Costs**" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "DzxIfOrB71wl" + }, + "source": [ + "This tutorial uses billable components of Google Cloud Platform (GCP):\n", "\n", + "* Cloud AI Platform\n", + "* Cloud Storage\n", + "* BigQuery\n", + "* AutoML Tables\n", "\n", - "To use AutoML Tables with BigQuery you do not need to download this dataset. However, if you would like to use AutoML Tables with GCS you may want to download this dataset and upload it into your GCP Project storage bucket. \n", + "Learn about [Cloud AI Platform pricing](https://cloud.google.com/ml-engine/docs/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing), [BigQuery pricing](https://cloud.google.com/bigquery/pricing), [AutoML Tables pricing](https://cloud.google.com/automl-tables/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/) to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ze4-nDLfK4pw" + }, + "source": [ + "## **Set up your local development environment**\n", "\n", - "Instructions to download dataset: \n", + "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", + "all the requirements to run this notebook. If you are using **AI Platform Notebook**, make sure the machine configuration type is **1 vCPU, 3.75 GB RAM** or above. You can skip this step." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "gCuSR8GkAgzl" + }, + "source": [ + "**Otherwise**, make sure your environment meets this notebook's requirements.\n", + "You need the following:\n", "\n", - "Sample Dataset: Download this dataset which contains sales data.\n", + "* The Google Cloud SDK\n", + "* Git\n", + "* Python 3\n", + "* virtualenv\n", + "* Jupyter notebook running in a virtual environment with Python 3\n", "\n", - "1. [Link to training data](https://console.cloud.google.com/bigquery?folder=&organizationId=&project=product-stockout&p=product-stockout&d=product_stockout&t=stockout&page=table): \n", + "The Google Cloud guide to [Setting up a Python development\n", + "environment](https://cloud.google.com/python/setup) and the [Jupyter\n", + "installation guide](https://jupyter.org/install) provide detailed instructions\n", + "for meeting these requirements. The following steps provide a condensed set of\n", + "instructions:\n", "\n", - "Dataset URI: \n", + "1. [Install and initialize the Cloud SDK.](https://cloud.google.com/sdk/docs/)\n", "\n", - "2. [Link to data for batch predictions](https://console.cloud.google.com/bigquery?folder=&organizationId=&project=product-stockout&p=product-stockout&d=product_stockout&t=batch_prediction_inputs&page=table): \n", + "2. [Install Python 3.](https://cloud.google.com/python/setup#installing_python)\n", "\n", - "Dataset URI: \n", + "3. [Install\n", + " virtualenv](https://cloud.google.com/python/setup#installing_and_using_virtualenv)\n", + " and create a virtual environment that uses Python 3.\n", "\n", - "Upload this dataset to GCS or BigQuery (optional). \n", + "4. Activate that environment and run `pip install jupyter` in a shell to install\n", + " Jupyter.\n", "\n", - "You could select either [GCS](https://cloud.google.com/storage/) or [BigQuery](https://cloud.google.com/bigquery/) as the location of your choice to store the data for this challenge. \n", + "5. Run `jupyter notebook` in a shell to launch Jupyter.\n", "\n", - "1. Storing data on GCS: [Creating storage buckets, Uploading data to storage buckets](https://cloud.google.com/storage/docs/creating-buckets)\n", - "2. Storing data on BigQuery: [Create and load data to BigQuery](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui) (optional)" + "6. Open this notebook in the Jupyter Notebook Dashboard." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "b--5FDDwCG9C" + "id": "BF1j6f9HApxa" }, "source": [ - "## 1. Before you begin\n", + "## **Set up your GCP project**\n", "\n", - "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to:\n", - "* Create a Google Cloud Platform (GCP) project and local development environment.\n", - "* Enable billing.\n", - "* Enable AutoML API.\n", - "* Enter your project ID in the cell below. Then run the cell to make sure the\n", + "**The following steps are required, regardless of your notebook environment.**\n", "\n", - "**If you are using Colab or AI Platform Notebooks**, your environment already meets\n", - "all the requirements to run this notebook. You can skip this step from the AutoML Tables documentation\n", + "1. [Select or create a GCP project.](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)\n", "\n", - "Cloud SDK uses the right project for all the commands in this notebook.\n", + "3. [Enable the AI Platform APIs and Compute Engine APIs.](https://console.cloud.google.com/flows/enableapi?apiid=ml.googleapis.com,compute_component)\n", "\n", - "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands" + "4. [Enable AutoML API.](https://console.cloud.google.com/apis/library/automl.googleapis.com?q=automl)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "i7EUnXsZhAGF" + }, + "source": [ + "## **PIP Install Packages and dependencies**\n", + "\n", + "Install addional dependencies not installed in Notebook environment" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "wyy5Lbnzg5fi" + }, "outputs": [], "source": [ - "PROJECT_ID = \"\" # @param {type:\"string\"}\n", - "COMPUTE_REGION = \"us-central1\" # Currently only supported region.\n", - "! gcloud config set project $PROJECT_ID" + "! pip install --upgrade --quiet --user google-cloud-automl\n", + "! pip install matplotlib" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "xZECt1oL429r" + "id": "kK5JATKPNf3I" }, "source": [ - "\n", - "\n", - "---\n", - "\n" + "**Note:** Try installing using `sudo`, if the above command throw any permission errors." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "f-YlNVLTYXXN" + }, + "source": [ + "`Restart` the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "C16j_LPrYbZa" + }, + "outputs": [], + "source": [ + "from IPython.core.display import HTML\n", + "HTML(\"\")" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "GWpby48cF6U7" + }, "source": [ - "This section runs initialization and authentication. It creates an authenticated session which is required for running any of the following sections." + "## **Set up your GCP Project Id**\n", + "\n", + "Enter your `Project Id` in the cell below. Then run the cell to make sure the\n", + "Cloud SDK uses the right project for all the commands in this notebook." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "oM1iC_MfAts1" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"[your-project-id]\" #@param {type:\"string\"}\n", + "COMPUTE_REGION = \"us-central1\" # Currently only supported region." ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "rstRPH9SyZj_" + "id": "dr--iN2kAylZ" }, "source": [ - "### Authenticate your GCP account\n", + "## **Authenticate your GCP account**\n", "\n", "**If you are using AI Platform Notebooks**, your environment is already\n", "authenticated. Skip this step." @@ -236,12 +478,12 @@ }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "3yyVCJHFSEKG" + }, "source": [ - "**If you are using Colab**, run the cell below and follow the instructions\n", - "when prompted to authenticate your account via oAuth.\n", - "\n", - "**Otherwise**, follow these steps:\n", + "Otherwise, follow these steps:\n", "\n", "1. In the GCP Console, go to the [**Create service account key**\n", " page](https://console.cloud.google.com/apis/credentials/serviceaccountkey).\n", @@ -251,213 +493,352 @@ "3. In the **Service account name** field, enter a name.\n", "\n", "4. From the **Role** drop-down list, select\n", - " **AutoML > AutoML Admin** and\n", - " **Storage > Storage Object Admin**.\n", + " **AutoML > AutoML Admin**,\n", + " **Storage > Storage Object Admin** and **BigQuery > BigQuery Admin**.\n", "\n", "5. Click *Create*. A JSON file that contains your key downloads to your\n", - "local environment.\n", - "\n", - "6. Enter the path to your service account key as the\n", - "`GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell." + "local environment." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Yt6PhVG0UdF1" + }, + "source": [ + "**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$` into these commands." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "q5TeVHKDMOJF" + }, "outputs": [], "source": [ + "# Upload the downloaded JSON file that contains your key.\n", "import sys\n", "\n", - "# If you are running this notebook in Colab, run this cell and follow the\n", - "# instructions to authenticate your GCP account. This provides access to your\n", - "# Cloud Storage bucket and lets you submit training jobs and prediction\n", - "# requests.\n", - "\n", "if 'google.colab' in sys.modules: \n", " from google.colab import files\n", " keyfile_upload = files.upload()\n", " keyfile = list(keyfile_upload.keys())[0]\n", " %env GOOGLE_APPLICATION_CREDENTIALS $keyfile\n", + " ! gcloud auth activate-service-account --key-file $keyfile" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "d1bnPeDVMR5Q" + }, + "source": [ + "***If you are running the notebook locally***, enter the path to your service account key as the `GOOGLE_APPLICATION_CREDENTIALS` variable in the cell below and run the cell" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "fsVNKXESYoeQ" + }, + "outputs": [], + "source": [ "# If you are running this notebook locally, replace the string below with the\n", "# path to your service account key and run this cell to authenticate your GCP\n", "# account.\n", - "else:\n", - " %env GOOGLE_APPLICATION_CREDENTIALS /path/to/service_account.json" + "\n", + "%env GOOGLE_APPLICATION_CREDENTIALS /path/to/service/account\n", + "! gcloud auth activate-service-account --key-file '/path/to/service/account'" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "BR0POq2UzE7e" + "id": "zgPO1eR3CYjk" }, "source": [ - "### Install the client library\n", - "Run the following cell to install the client libary using `pip`." + "## **Create a Cloud Storage bucket**\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "When you submit a training job using the Cloud SDK, you upload a Python package\n", + "containing your training code to a Cloud Storage bucket. AI Platform runs\n", + "the code from this package. In this tutorial, AI Platform also saves the\n", + "trained model that results from your job in the same bucket. You can then\n", + "create an AI Platform model version based on this output in order to serve\n", + "online predictions.\n", + "\n", + "Set the name of your Cloud Storage bucket below. It must be unique across all\n", + "Cloud Storage buckets. \n", + "\n", + "You may also change the `REGION` variable, which is used for operations\n", + "throughout the rest of this notebook. Make sure to [choose a region where Cloud\n", + "AI Platform services are\n", + "available](https://cloud.google.com/ml-engine/docs/tensorflow/regions). You may\n", + "not use a Multi-Regional Storage bucket for training with AI Platform." ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "cellView": "both", + "colab": {}, + "colab_type": "code", + "id": "MzGDU7TWdts_" + }, "outputs": [], "source": [ - "%pip install --quiet google-cloud-automl" + "BUCKET_NAME = \"[your-bucket-name]\" #@param {type:\"string\"}" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "-EcIXiGsCePi" + }, "source": [ - "Restart the kernel to allow automl_v1beta1 to be imported for Jupyter Notebooks." + "**Only if your bucket doesn't exist**: Run the following cell to create your Cloud Storage bucket. Make sure Storage > Storage Admin role is enabled" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "NIq7R4HZCfIc" + }, "outputs": [], "source": [ - "from IPython.core.display import HTML\n", - "HTML(\"\")" + "! gsutil mb -p $PROJECT_ID -l $COMPUTE_REGION gs://$BUCKET_NAME" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "ucvCsknMCims" + }, "source": [ - "### Import libraries and define constants\n", - "\n", - "First, import Python libraries required for training,\n", - "The code example below demonstrates importing the AutoML Python API module into a python script. " + "Finally, validate access to your Cloud Storage bucket by examining its contents:" ] }, { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "vhOb7YnwClBb" + }, "outputs": [], "source": [ - "# AutoML library\n", - "from google.cloud import automl\n", - "\n", - "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", - "import matplotlib.pyplot as plt\n", - "\n", - "client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)" + "! gsutil ls -al gs://$BUCKET_NAME" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "s3F2xbEJdDvN" + "id": "XoEqT2Y4DJmf" }, "source": [ - "### Test the set up" + "## **Import libraries and define constants**" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "rUlBcZ3OfWcJ" + "id": "Y9Uo3tifg1kx" }, "source": [ - "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n", - "\n", - "If no dataset has previously imported into AutoML Tables, you shall expect an empty return." + "Import relevant packages.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "cellView": "both", "colab": {}, "colab_type": "code", - "id": "sf32nKXIqYje" + "id": "pRUOFELefqf1" }, "outputs": [], "source": [ - "#@title List datasets. { vertical-output: true }\n", - "\n", - "list_datasets = client.list_datasets()\n", - "datasets = { dataset.display_name: dataset.name for dataset in list_datasets }\n", - "datasets" + "from __future__ import absolute_import\n", + "from __future__ import division\n", + "from __future__ import print_function" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "02J-91J6ZMUk" + }, + "outputs": [], + "source": [ + "# AutoML library.\n", + "from google.cloud import automl_v1beta1 as automl\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "t9uE8MvMkOPd" + "id": "WIoaocE_ITKY" }, "source": [ - "You can also print the list of your models by running the following cell.\n", - "\n", - "If no model has previously trained using AutoML Tables, you shall expect an empty return." + "Populate the following cell with the necessary constants and run it to initialize constants." ] }, { "cell_type": "code", "execution_count": null, "metadata": { - "cellView": "both", "colab": {}, "colab_type": "code", - "id": "j4-bYRSWj7xk" + "id": "1e9hznN_IUej" }, "outputs": [], "source": [ - "#@title List models. { vertical-output: true }\n", - "\n", - "list_models = client.list_models()\n", - "models = { model.display_name: model.name for model in list_models }\n", - "models" + "#@title Constants { vertical-output: true }\n", + "\n", + "# A name for the AutoML tables Dataset to create.\n", + "DATASET_DISPLAY_NAME = 'stockout_data' #@param {type: 'string'}\n", + "# The BigQuery Dataset URI to import data from.\n", + "BQ_INPUT_URI = 'bq://product-stockout.product_stockout.stockout' #@param {type: 'string'}\n", + "# A name for the AutoML tables model to create.\n", + "MODEL_DISPLAY_NAME = 'stockout_model' #@param {type: 'string'}\n", + "\n", + "assert all([\n", + " PROJECT_ID,\n", + " COMPUTE_REGION,\n", + " DATASET_DISPLAY_NAME,\n", + " BQ_INPUT_URI,\n", + " MODEL_DISPLAY_NAME,\n", + "])" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "qozQWMnOu48y" + "id": "MLtmkt7GbGlC" }, "source": [ + "Initialize the client for AutoML and AutoML Tables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "fZiTNuQmcBoN" + }, + "outputs": [], + "source": [ + "# Initialize the clients.\n", + "automl_client = automl.AutoMlClient()\n", + "tables_client = automl.TablesClient(project=PROJECT_ID, region=COMPUTE_REGION)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "xdJykMXDozoP" + }, + "source": [ + "## **Test the set up**\n", "\n", + "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n", "\n", - "---\n", - "\n" + "If no dataset has previously imported into AutoML Tables, you shall expect an empty return." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "_dKylOQTpF58" + }, + "outputs": [], + "source": [ + "# List the datasets.\n", + "list_datasets = tables_client.list_datasets()\n", + "datasets = { dataset.display_name: dataset.name for dataset in list_datasets }\n", + "datasets" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "ODt86YuVDZzm" + "id": "dleTdOMaplSM" }, "source": [ - "## 2. Import training data" + "You can also print the list of your models by running the following cell.\n", + "\n", + "If no model has previously trained using AutoML Tables, you shall expect an empty return.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "tMXP6no1pn9p" + }, + "outputs": [], + "source": [ + "# List the models.\n", + "list_models = tables_client.list_models()\n", + "models = { model.display_name: model.name for model in list_models }\n", + "models" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "XwjZc9Q62Fm5" + "id": "RzzzdXANp858" }, "source": [ - "### Create dataset" + "## **Import training data**\n", + "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "_JfZFGSceyE_" + "id": "5i8PBNWJ3rAv" }, "source": [ + "#### **Create dataset**\n", + "\n", "Select a dataset display name and pass your table source information to create a new dataset." ] }, @@ -467,15 +848,13 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "Z_JErW3cw-0J" + "id": "OXddfTPoqO1Z" }, "outputs": [], "source": [ - "#@title Create dataset { vertical-output: true, output-height: 200 }\n", - "\n", - "dataset_display_name = 'stockout_data' #@param {type: 'string'}\n", - "\n", - "dataset = client.create_dataset(dataset_display_name)\n", + "# Create dataset.\n", + "dataset = tables_client.create_dataset(DATASET_DISPLAY_NAME)\n", + "dataset_name = dataset.name\n", "dataset" ] }, @@ -483,24 +862,25 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "35YZ9dy34VqJ" + "id": "InYuWIf5qQe7" }, "source": [ - "### Import data" + "#### **Import data**\n", + "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "3c0o15gVREAw" + "id": "aNJvoyOAmAOf" }, "source": [ "You can import your data to AutoML Tables from GCS or BigQuery. For this solution, you will import data from a BigQuery Table. The URI for your table is in the format of `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", "\n", - "The BigQuery Table used for demonstration purpose can be accessed as `bq://product-stockout.product_stockout.stockout`. \n", + "The BigQuery Table used for demonstration purpose can be accessed as `bq://product-stockout.product_stockout.stockout`.\n", "\n", - "See the table schema and dataset description from the README. " + "See the table schema and dataset description from the README." ] }, { @@ -509,21 +889,22 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "FNVYfpoXJsNB" + "id": "Mzjb3xgLsPb7" }, "outputs": [], "source": [ - "#@title Import data { vertical-output: true }\n", - "\n", - "import_data_operation = client.import_data(\n", + "# Import data.\n", + "import_data_response = tables_client.import_data(\n", " dataset=dataset,\n", - " bigquery_input_uri=bq_input_uri,\n", + " bigquery_input_uri=BQ_INPUT_URI,\n", ")\n", - "print('Dataset import operation: {}'.format(import_data_operation))\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))\n", "\n", "# Synchronous check of operation status. Wait until import is done.\n", - "import_data_operation.result()\n", - "dataset = client.get_dataset(dataset_name=dataset.name)\n", + "print('Dataset import response: {}'.format(import_data_response.result()))\n", + "\n", + "# Verify the status by checking the example_count field.\n", + "dataset = tables_client.get_dataset(dataset_name=dataset_name)\n", "dataset" ] }, @@ -531,10 +912,10 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "_WLvyGIDe9ah" + "id": "gxVzEhiBqfWr" }, "source": [ - "Importing this stockout datasets takes about 10 minutes. \n", + "Importing this stockout datasets takes about 10 minutes.\n", "\n", "If you re-visit this Notebook, uncomment the following cell and run the command to retrieve your dataset. Replace `YOUR_DATASET_NAME` with its actual value obtained in the preceding cells.\n", "\n", @@ -547,31 +928,22 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "P6NkRMyJfAGm" + "id": "fpP1xWscqhJ8" }, "outputs": [], "source": [ "# dataset_name = '' #@param {type: 'string'}\n", - "# dataset = client.get_dataset(dataset_name=dataset_name) " - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "QdxBI4s44ZRI" - }, - "source": [ - "### Review the specs" + "# dataset = tables_client.get_dataset(dataset_name=dataset_name)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "RC0PWKqH4jwr" + "id": "Neewv2bXqkFf" }, "source": [ + "## **Review the specs**\n", "Run the following command to see table specs such as row count." ] }, @@ -581,21 +953,21 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "v2Vzq_gwXxo-" + "id": "jn5-g-RwquOd" }, "outputs": [], "source": [ - "# List table specs\n", - "list_table_specs_response = client.list_table_specs(dataset=dataset)\n", + "# List table specs.\n", + "list_table_specs_response = tables_client.list_table_specs(dataset=dataset)\n", "table_specs = [s for s in list_table_specs_response]\n", "\n", - "# List column specs\n", - "list_column_specs_response = client.list_column_specs(dataset=dataset)\n", + "# List column specs.\n", + "list_column_specs_response = tables_client.list_column_specs(dataset=dataset)\n", "column_specs = {s.display_name: s for s in list_column_specs_response}\n", "\n", - "# Print Features and data_type:\n", - "\n", - "features = [(key, data_types.TypeCode.Name(value.data_type.type_code)) for key, value in column_specs.items()]\n", + "# Print Features and data_type.\n", + "features = [(key, data_types.TypeCode.Name(value.data_type.type_code))\n", + " for key, value in column_specs.items()]\n", "print('Feature list:\\n')\n", "for feature in features:\n", " print(feature[0],':', feature[1])" @@ -604,9 +976,14 @@ { "cell_type": "code", "execution_count": null, - "metadata": {}, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "63QFqQfxqyCR" + }, "outputs": [], "source": [ + "# Table schema pie chart.\n", "type_counts = {}\n", "for column_spec in column_specs.values():\n", " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", @@ -621,7 +998,7 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "Lqjq4X43v3ON" + "id": "opNreHuMqzJJ" }, "source": [ "In the pie chart above, you see this dataset contains three variable types: `FLOAT64` (treated as `Numeric`), `CATEGORY` (treated as `Categorical`) and `STRING` (treated as `Text`). " @@ -631,46 +1008,29 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "FNykW_YOYt6d" - }, - "source": [ - "___" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "kNRVJqVOL8h3" - }, - "source": [ - "## 3. Update dataset: assign a label column and enable nullable columns" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "VsOPwxN9fOIl" + "id": "avNsksNFrEAa" }, "source": [ - "### Get column specs" + "## **Update dataset: assign a label column and enable nullable columns**\n", + "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "-57gehId9PQ5" + "id": "Dk2jFo274O-z" }, "source": [ - "AutoML Tables automatically detects your data column type. \n", + "#### **Get column specs**\n", + "\n", + "AutoML Tables automatically detects your data column type.\n", "\n", "There are a total of 120 columns in this stockout dataset.\n", "\n", "Run the following command to check the column data type that automaticallyed detected. If columns contains only numerical values, but they represent categories, change that column data type to caregorical by updating your schema.\n", "\n", - "In addition, AutoML Tables detects `Stockout` to be categorical that chooses to run a classification model. " + "In addition, AutoML Tables detects `Stockout` to be categorical that chooses to run a classification model." ] }, { @@ -679,12 +1039,10 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "jso_JBI9fgy6" + "id": "jvF9_3ierVdu" }, "outputs": [], "source": [ - "#@title Check column data type { vertical-output: true }\n", - "\n", "# Print column data types.\n", "for column in column_specs:\n", " print(column, '-', column_specs[column].data_type)" @@ -694,10 +1052,10 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "iRqdQ7Xiq04x" + "id": "fGamAlLgrXnL" }, "source": [ - "### Update columns: make categorical\n", + "#### **Update columns: make categorical**\n", "\n", "From the column data type, you noticed `Item_Number`, `Category`, `Vendor_Number`, `Store_Number`, `Zip_Code` and `County_Number` have been autodetected as `FLOAT64` (Numerical) instead of `CATEGORY` (Categorical). \n", "\n", @@ -712,44 +1070,37 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "_xePITEYf5po" + "id": "5PhtaixArclw" }, "outputs": [], "source": [ - "# Update dataset\n", - "categorical_column_names = ['Item_Number',\n", - " 'Category',\n", - " 'Vendor_Number',\n", - " 'Store_Number',\n", - " 'Zip_Code',\n", - " 'County_Number']\n", - "is_nullable = [False, \n", - " False,\n", - " False,\n", - " False,\n", - " True,\n", - " True]\n", + "type_code='CATEGORY' #@param {type:'string'}\n", + "\n", + "# Update dataset.\n", + "categorical_column_names = ['Item_Number', 'Category', 'Vendor_Number', \n", + " 'Store_Number', 'Zip_Code', 'County_Number']\n", + "\n", + "is_nullable = [False, False, False, False, True, True] \n", "\n", "for i in range(len(categorical_column_names)):\n", " column_name = categorical_column_names[i]\n", " nullable = is_nullable[i]\n", - " client.update_column_spec(\n", + " tables_client.update_column_spec(\n", " dataset=dataset,\n", " column_spec_display_name=column_name,\n", - " type_code='CATEGORY',\n", + " type_code=type_code,\n", " nullable=nullable,\n", - " )\n" + " )" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "nDMH_chybe4w" + "id": "ypBV6myxrjTw" }, "source": [ - "### Update dataset: assign a label\n", - "\n", + "#### **Update dataset: Assign a label**\n", "Select the target column and update the dataset." ] }, @@ -759,14 +1110,14 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "hVIruWg0u33t" + "id": "x1X4jv3-rnO4" }, "outputs": [], "source": [ "#@title Update dataset { vertical-output: true }\n", "\n", "target_column_name = 'Stockout' #@param {type: 'string'}\n", - "update_dataset_response = client.set_target_column(\n", + "update_dataset_response = tables_client.set_target_column(\n", " dataset=dataset,\n", " column_spec_display_name=target_column_name,\n", ")\n", @@ -777,35 +1128,55 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "z23NITLrcxmi" + "id": "qlCneadcrvoi" }, "source": [ - "___" + "## **Creating a model**\n", + "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "FcKgvj1-Tbgj" + "id": "oCJkY5bX4clh" }, "source": [ - "## 4. Creating a model" + "#### **Train a model**\n", + "\n", + "Training the model may take one hour or more. To obtain the results with less training time or budget, you can set [`train_budget_milli_node_hours`](https://cloud.google.com/automl-tables/docs/reference/rest/v1beta1/projects.locations.models), which is the train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour.\n", + "\n", + "For demonstration purpose, the following command sets the budget as 1 node hour `('train_budget_milli_node_hours': 1000)`. You can increase that number up to a maximum of 72 hours `('train_budget_milli_node_hours': 72000)` for the best model performance.\n", + "\n", + "Even with a budget of 1 node hour (the minimum possible budget), training a model can take more than the specified node hours\n", + "\n", + "You can also select the objective to optimize your model training by setting `optimization_objective`. This solution optimizes the model by maximizing the Area Under the Precision-Recall (PR) Curve." ] }, { - "cell_type": "markdown", + "cell_type": "code", + "execution_count": null, "metadata": { - "colab_type": "text", - "id": "Pnlk8vdQlO_k" + "colab": {}, + "colab_type": "code", + "id": "fq5Lvt66r0gK" }, + "outputs": [], "source": [ - "### Train a model\n", - "Training the model may take one hour or more. To obtain the results with less training time or budget, you can set [`train_budget_milli_node_hours`](https://cloud.google.com/automl-tables/docs/reference/rest/v1beta1/projects.locations.models), which is the train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour. \n", + "# The number of hours to train the model.\n", + "model_train_hours = 1 #@param {type:'integer'}\n", + "# Set optimization objective to train a model.\n", + "model_optimization_objective = 'MAXIMIZE_AU_PRC' #@param {type:'string'}\n", "\n", - "For demonstration purpose, the following command sets the budget as 1 node hour. You can increate that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000) for the best model performance. \n", + "create_model_response = tables_client.create_model(\n", + " MODEL_DISPLAY_NAME,\n", + " dataset=dataset,\n", + " train_budget_milli_node_hours=model_train_hours*1000,\n", + " optimization_objective=model_optimization_objective,\n", + ")\n", + "operation_id = create_model_response.operation.name\n", "\n", - "You can also select the objective to optimize your model training by setting `optimization_objective`. This solution optimizes the model by maximizing the Area Under the Precision-Recall (PR) Curve. \n" + "print('Create model operation: {}'.format(create_model_response.operation))" ] }, { @@ -814,23 +1185,13 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "11izNd6Fu37N" + "id": "7YJy1jh2VXRl" }, "outputs": [], "source": [ - "#@title Create model { vertical-output: true }\n", - "\n", - "model_display_name = 'stockout_model' #@param {type:'string'}\n", - "\n", - "create_model_response = client.create_model(\n", - " model_display_name,\n", - " dataset=dataset,\n", - " train_budget_milli_node_hours=1000,\n", - " optimization_objective='MAXIMIZE_AU_PRC',\n", - ")\n", - "print('Create model operation: {}'.format(create_model_response.operation))\n", "# Wait until model training is done.\n", "model = create_model_response.result()\n", + "model_name = model.name\n", "model" ] }, @@ -838,10 +1199,10 @@ "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "neYjToB36q9E" + "id": "Y0U3o4hmr3co" }, "source": [ - "If your Colab times out, use `client.list_models()` to check whether your model has been created. \n", + "If your Colab times out, use `tables_client.list_models()` to check whether your model has been created.\n", "\n", "Then uncomment the following cell and run the command to retrieve your model. Replace `YOUR_MODEL_NAME` with its actual value obtained in the preceding cell.\n", "\n", @@ -854,51 +1215,34 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "QptCwUIK7yhU" + "id": "2bVZsL6Er5XN" }, "outputs": [], "source": [ - "# model_name = '' #@param {type: 'string'}\n", - "# model = client.get_model(model_name=model_name)" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "1wS1is9IY5nK" - }, - "source": [ - "___" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "colab_type": "text", - "id": "TarOq84-GXch" - }, - "source": [ - "## 5. Batch prediction" + "#model_name = '' #@param {type: 'string'}\n", + "# model = tables_client.get_model(model_name=model_name)" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "Soy5OB8Wbp_R" + "id": "yCrBEllhr--f" }, "source": [ - "### Initialize prediction" + "## **Batch prediction**\n", + "\n" ] }, { "cell_type": "markdown", "metadata": { "colab_type": "text", - "id": "39bIGjIlau5a" + "id": "Z-RklZCA4j_3" }, "source": [ + "#### **Initialize prediction**\n", + "\n", "Your data source for batch prediction can be GCS or BigQuery. For this solution, you will use a BigQuery Table as the input source. The URI for your table is in the format of `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", "\n", "To write out the predictions, you need to specify a GCS bucket `gs://BUCKET_NAME`.\n", @@ -914,21 +1258,21 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "gkF3bH0qu4DU" + "id": "tgS55lD8sJUi" }, "outputs": [], "source": [ "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n", + "batch_predict_bq_input_uri = 'bq://product-stockout.product_stockout.batch_prediction_inputs' #@param {type:'string'}\n", + "batch_predict_gcs_output_uri_prefix = 'gs://{}'.format(BUCKET_NAME) #@param {type:'string'}\n", "\n", - "batch_predict_bq_input_uri = 'bq://product-stockout.product_stockout.batch_prediction_inputs'\n", - "batch_predict_gcs_output_uri_prefix = 'gs://' #@param {type:'string'}\n", - "\n", - "batch_predict_response = client.batch_predict(\n", - " model=model, \n", - " biqquery_input_uris=batch_predict_bq_input_uri,\n", + "batch_predict_response = tables_client.batch_predict(\n", + " model_name=model_name, \n", + " bigquery_input_uri=batch_predict_bq_input_uri,\n", " gcs_output_uri_prefix=batch_predict_gcs_output_uri_prefix,\n", ")\n", "print('Batch prediction operation: {}'.format(batch_predict_response.operation))\n", + "\n", "# Wait until batch prediction is done.\n", "batch_predict_result = batch_predict_response.result()\n", "batch_predict_response.metadata" @@ -940,32 +1284,51 @@ "metadata": { "colab": {}, "colab_type": "code", - "id": "kgwbJwS2iLpc" + "id": "JCa218LosND5" }, "outputs": [], "source": [ - "#@title Check prediction results { vertical-output: true }\n", - "\n", - "gcs_output_directory = batch_predict_response.metadata.batch_predict_details.output_info.gcs_output_directory\n", + "# Check prediction results.\n", + "gcs_output_directory = batch_predict_response.metadata.batch_predict_details\\\n", + " .output_info.gcs_output_directory\n", "result_file = gcs_output_directory + 'tables_1.csv'\n", "print('Batch prediction results are stored as: {}'.format(result_file))" ] }, { "cell_type": "markdown", - "metadata": {}, + "metadata": { + "colab_type": "text", + "id": "TpV-iwP9qw9c" + }, "source": [ - "---" + "## **Cleaning up**\n", + "\n", + "To clean up all GCP resources used in this project, you can [delete the GCP\n", + "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial." ] }, { - "cell_type": "markdown", - "metadata": {}, + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "sx_vKniMq9ZX" + }, + "outputs": [], "source": [ - "## 6. Clean up\n", + "# Delete model resource.\n", + "tables_client.delete_model(model_name=model_name)\n", + "\n", + "# Delete dataset resource.\n", + "tables_client.delete_dataset(dataset_name=dataset_name)\n", + "\n", + "# Delete Cloud Storage objects that were created.\n", + "! gsutil -m rm -r gs://$BUCKET_NAME\n", "\n", - "To clean up all GCP resources used in this notebook, you can [delete the GCP\n", - "project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects)." + "# If training model is still running, cancel it.\n", + "automl_client.transport._operations_client.cancel_operation(operation_id)" ] } ], @@ -973,8 +1336,7 @@ "colab": { "collapsed_sections": [], "name": "retail_product_stockout_prediction.ipynb", - "provenance": [], - "version": "0.3.2" + "provenance": [] }, "kernelspec": { "display_name": "Python 3", @@ -991,9 +1353,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.7" + "version": "3.5.3" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 }