Update sagemaker debugger TF actions notebook to use built-in actions (aws#1964)

ndodda-amazon · web-flow · commit e2885b090c74 · 2021-02-03T14:02:38.000-08:00
* Update TF actions notebook to use built-in actions
diff --git a/sagemaker-debugger/tensorflow_action_on_rule/detect_stalled_training_job_and_actions.ipynb b/sagemaker-debugger/tensorflow_action_on_rule/detect_stalled_training_job_and_actions.ipynb
@@ -4,9 +4,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Detect Stalled Training and Stop Training Job Using SageMaker Debugger Rule\n",
+    "# Detect Stalled Training and Invoke Actions Using SageMaker Debugger Rule\n",
     " \n",
-    "This notebook shows you how to use the `StalledTrainingRule` built-in rule. This rule can take an action to stop your training job, when the rule detects an inactivity in your training job for a certain time period. This functionality helps you monitor the training job status and reduces redundant resource usage.\n",
+    "This notebook shows you how to use the `StalledTrainingRule` built-in rule. This rule can take an action to stop your training job or send you an email/SMS, when the rule detects an inactivity in your training job for a certain time period. This functionality helps you monitor the training job status and reduces redundant resource usage.\n",
     "\n",
     "## How the StalledTrainingRule Built-in Rule Works\n",
     "\n",
@@ -17,6 +17,23 @@
     "The Debugger `StalledTrainingRule` watches tensor updates from your training job. If the rule doesn't find new tensors updated to the default S3 URI for a threshold period of time, it takes an action to trigger the `StopTrainingJob` API operation. The following code cells set up a SageMaker TensorFlow estimator with the Debugger `StalledTrainingRule` to watch the `losses` pre-built tensor collection."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Install custom packages\n",
+    "These packages were built manually with the changes needed to run rules with actions, since the changes have not been released yet. Remember to refresh the kernel after installing these packages"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "! pip install -q -U sagemaker"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -55,15 +72,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Create a unique training job prefix\n",
-    "A unique prefix must be specified for `StalledTrainingRule` to identify the exact training job name that you want to monitor and stop when the rule triggers the stalled training job issue.\n",
-    "If there are multiple training jobs sharing the same prefix, this rule may react to other training jobs. If the rule cannot find the exact training job name with a provided prefix, it falls back to safe mode and does not stop the training job. The rule evaluation process goes on in parallel while the training jobs are running. If you want to access the rule job logs, you will later find how to get the information at [Get a direct Amazon CloudWatch URL to find the current rule processing job log](#cw-url).\n",
+    "### Create the actions to be used in the rules\n",
     "\n",
-    "The following code cell includes:\n",
-    "* a code line to create a unique `base_job_name_prefix`\n",
-    "* a stalled training job rule configuration object\n",
+    "The following code cells include:\n",
+    "* a code line to create the action objects\n",
+    "* a stalled training job rule configuration object that uses these actions\n",
     "* a SageMaker TensorFlow estimator configuration with the Debugger `rules` parameter to run the built-in rule\n",
     "\n",
+    "Valid action objects are individual actions (`StopTraining`, `Email`, `SMS`) or an `ActionList` with a combination of these.\n",
+    "\n",
     "**Note**: Debugger collects `loss` tensors by default every 500 steps."
    ]
   },
@@ -73,32 +90,46 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# Append current time to your training job name to generate a unique base_job_name_prefix\n",
-    "import time\n",
-    "base_job_name_prefix= 'smdebug-stalled-demo-' + str(int(time.time()))\n",
-    "\n",
+    "training_job_prefix = None # Feel free to customize this if desired."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "stop_training_action = rule_configs.StopTraining() # or specify a training job prefix with StopTraining(\"prefix\")\n",
+    "actions = stop_training_action"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "# Configure a StalledTrainingRule rule parameter object\n",
     "stalled_training_job_rule = [\n",
     "    Rule.sagemaker(\n",
     "        base_config=rule_configs.stalled_training_rule(),\n",
     "        rule_parameters={\n",
-    "                \"threshold\": \"120\", \n",
-    "                \"stop_training_on_fire\": \"True\",\n",
-    "                \"training_job_name_prefix\": base_job_name_prefix\n",
-    "        }\n",
+    "                \"threshold\": \"60\", \n",
+    "        },\n",
+    "        actions=actions\n",
     "    )\n",
     "]\n",
     "\n",
     "# Configure a SageMaker TensorFlow estimator\n",
     "estimator = TensorFlow(\n",
     "    role=sagemaker.get_execution_role(),\n",
-    "    base_job_name=base_job_name_prefix,\n",
-    "    train_instance_count=1,\n",
-    "    train_instance_type='ml.m5.4xlarge',\n",
+    "    base_job_name=\"stalled-training-test\",\n",
+    "    instance_count=1,\n",
+    "    instance_type='ml.m5.4xlarge',\n",
     "    entry_point='src/simple_stalled_training.py', # This sample script forces the training job to sleep for 10 minutes\n",
     "    framework_version='1.15.0',\n",
     "    py_version='py3',\n",
-    "    train_max_run=3600,\n",
+    "    max_run=3600,\n",
     "    ## Debugger-specific parameter\n",
     "    rules = stalled_training_job_rule\n",
     ")"
@@ -177,6 +208,16 @@
     "        time.sleep(15)"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "description = client.describe_training_job(TrainingJobName=job_name)\n",
+    "print(description)"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},