Skip to content

Commit cdbdfdd

Browse files
author
EC2 Default User
committed
fixed code bugs
1 parent a6dd418 commit cdbdfdd

File tree

5 files changed

+32
-30
lines changed

5 files changed

+32
-30
lines changed

content/_index.md

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,7 @@ weight: 1
66

77
# Distributed Training Workshop
88

9-
Welcome to the distributed training workshop with TensorFlow on Amazon SageMaker and Amazon Elastic Kubernetes Service (EKS)
10-
11-
9+
### Welcome to the distributed training workshop with TensorFlow on Amazon SageMaker and Amazon Elastic Kubernetes Service (EKS)
1210
### At the end of this workshop, you'll be able to:
1311

1412
#### Identify when to consider distributed training

notebooks/part-1-horovod/cifar10-distributed.ipynb

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -186,9 +186,9 @@
186186
"if not os.path.exists(checkpoint_dir):\n",
187187
" os.makedirs(checkpoint_dir)\n",
188188
"\n",
189-
"train_dir = '../data/train'\n",
190-
"validation_dir = '../data/validation'\n",
191-
"eval_dir = '../data/eval'\n",
189+
"train_dir = '../dataset/train'\n",
190+
"validation_dir = '../dataset/validation'\n",
191+
"eval_dir = '../dataset/eval'\n",
192192
"\n",
193193
"train_dataset = make_batch(train_dir+'/train.tfrecords', batch_size)\n",
194194
"val_dataset = make_batch(validation_dir+'/validation.tfrecords', batch_size)\n",

notebooks/part-2-sagemaker/cifar10-sagemaker-distributed.ipynb

Lines changed: 26 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919
},
2020
{
2121
"cell_type": "code",
22-
"execution_count": 2,
22+
"execution_count": null,
2323
"metadata": {},
2424
"outputs": [],
2525
"source": [
@@ -30,8 +30,8 @@
3030
"\n",
3131
"sagemaker_session = sagemaker.Session()\n",
3232
"role = sagemaker.get_execution_role()\n",
33-
"#bucket_name = 'tfworld2019-<your_bucket_name>'\n",
34-
"bucket_name = 'tfworld2019'"
33+
"\n",
34+
"bucket_name = '<your_bucket_name>'"
3535
]
3636
},
3737
{
@@ -41,7 +41,7 @@
4141
"**Step 2:** Specify hyperparameters, instance type and number of instances to distribute training to. The `hvd_processes_per_host` corrosponds to number of GPUs per instances. \n",
4242
"For example, if you choose:\n",
4343
"```\n",
44-
"hvd_instance_type = 'ml.p3.8large'\n",
44+
"hvd_instance_type = 'ml.p3.8xlarge'\n",
4545
"hvd_instance_count = 2\n",
4646
"hvd_processes_per_host = 4\n",
4747
"```\n",
@@ -138,6 +138,15 @@
138138
" job_name=job_name, wait=False)"
139139
]
140140
},
141+
{
142+
"cell_type": "markdown",
143+
"metadata": {},
144+
"source": [
145+
"**Note**: in the `estimator_hvd.fit()` function above, change`wait=True` if you want to see the training output in the Jupyter notebook.\n",
146+
"Advantage of setting `wait=False`, is that you can continue to run cells. \n",
147+
"Since we're unblocked due to `wait=False` we can now launch tensorboard in the notebook and monitor progress."
148+
]
149+
},
141150
{
142151
"cell_type": "markdown",
143152
"metadata": {},
@@ -147,22 +156,9 @@
147156
},
148157
{
149158
"cell_type": "code",
150-
"execution_count": 3,
159+
"execution_count": null,
151160
"metadata": {},
152-
"outputs": [
153-
{
154-
"name": "stdout",
155-
"output_type": "stream",
156-
"text": [
157-
"TensorBoard 1.14.0 at http://ip-172-16-89-111:6006/ (Press CTRL+C to quit)\n",
158-
"W1028 20:55:37.536751 140564607526656 core_plugin.py:172] Unable to get first event timestamp for run sm-dist-1x1-gpu-instances2019-10-24-10-08-55-297: No event timestamp could be found\n",
159-
"W1028 20:55:37.777247 140564607526656 core_plugin.py:172] Unable to get first event timestamp for run sm-dist-1x8-gpu-instances2019-10-24-07-43-40-297: No event timestamp could be found\n",
160-
"W1028 20:55:37.984411 140564607526656 core_plugin.py:172] Unable to get first event timestamp for run sm-dist-2x1-gpu-instances2019-10-28-10-24-06-301: No event timestamp could be found\n",
161-
"W1028 20:55:38.320934 140564607526656 core_plugin.py:172] Unable to get first event timestamp for run sm-dist-2x1-workers2019-10-28-20-28-23-301: No event timestamp could be found\n",
162-
"^C\n"
163-
]
164-
}
165-
],
161+
"outputs": [],
166162
"source": [
167163
"!S3_REGION=us-west-2 tensorboard --logdir s3://{bucket_name}/tensorboard_logs/"
168164
]
@@ -171,11 +167,19 @@
171167
"cell_type": "markdown",
172168
"metadata": {},
173169
"source": [
174-
"Open a new browser tan and navigate to the folloiwng link to access TensorBoard:\n",
175-
"<br> https://tfworld2019.notebook.us-west-2.sagemaker.aws/proxy/6006/\n",
176-
"<br> Make sure that the name of the notebook instance is correct in the link above.\n",
170+
"Open a new browser and navigate to the folloiwng link to access TensorBoard:\n",
171+
"<br> https://***your_notebook_name***.notebook.us-west-2.sagemaker.aws/proxy/6006/\n",
172+
"<br> <br> \n",
173+
"**Note:** Make sure to replace `your_notebook_name` with the name of the notebook instance. You can find the name of your notebook instance on the browser URL.\n",
177174
"<br> Don't forget the slash at the end of the URL 6006/"
178175
]
176+
},
177+
{
178+
"cell_type": "code",
179+
"execution_count": null,
180+
"metadata": {},
181+
"outputs": [],
182+
"source": []
179183
}
180184
],
181185
"metadata": {

notebooks/part-3-kubernetes/specs/eks_tf_training_job-cpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ spec:
1212
restartPolicy: Never
1313
containers:
1414
- name: eks-tf-dist-job
15-
image: 453691756499.dkr.ecr.us-west-2.amazonaws.com/tfworld2019:latest
15+
image: <YOUR_DOCKER_IMAGE>
1616
env:
1717
- name: HDF5_USE_FILE_LOCKING
1818
value: 'FALSE'

notebooks/part-3-kubernetes/specs/eks_tf_training_job-gpu.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ spec:
1212
restartPolicy: Never
1313
containers:
1414
- name: eks-tf-dist-job
15-
image: 453691756499.dkr.ecr.us-west-2.amazonaws.com/tfworld2019:latest
15+
image: <YOUR_DOCKER_IMAGE>
1616
env:
1717
- name: HDF5_USE_FILE_LOCKING
1818
value: 'FALSE'

0 commit comments

Comments
 (0)