Working with Tensorflow, GPUs, and containers

ospool

path
software_examples/machine_learning/tutorial-tensorflow-containers/README.md

Working with Tensorflow, GPUs, and containers

In this tutorial, we explore GPUs and containers on OSG, using the popular Tensorflow sofware package. Tensorflow is a good example here as the software is too complex to bundle up and ship with your job. Containers solve this problem by defining a full OS image, containing not only the complex software package, but dependencies and environment configuration as well.

https://www.tensorflow.org/ desribes TensorFlow as:

TensorFlow is an open source software library for numerical computation using data flow graphs. Nodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. The flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. TensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well.

Defining container images

Defining containers is fully described in the Docker and Singularity Containers section. Here we will just provide an overview of how you could take something like an existing Tensorflow image provided by OSG staff, and extend it by adding your own modules to it. Let's assume you like Tensorflow version 2.3. The definition of this image can be found in Github: Dockerfile. You don't really need to understand how an image was built in order to use it. As described in the containers documentation, make sure the HTCondor submit file has:

Requirements = HAS_SINGULARITY == TRUE
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow:2.3"

If you want to extend an existing image, you can just inherit from the parent image available on DockerHub here. For example, if you just need some additional Python packages, your new Dockerfile could look like:

FROM opensciencegrid/tensorflow:2.3

RUN python3 -m pip install some_package_name

You can then docker build and docker push it so that your new image is available on DockerHub. Note that OSG does not provide any infrastructure for these steps. You will have to complete them on your own computer or using the DockerHub build infrastructure.

Adding a container to the OSG CVMFS distribution mechanism

How to add a container image to the OSG CVMFS distribution mechanism is also described in Docker and Singularity Containers, but a quick scan of the cvmfs-singularity-sync and specifically the docker_images.txt file show us that the tensorflow images are listed as:

opensciencegrid/tensorflow:*
opensciencegrid/tensorflow-gpu:*

Those two lines means that all tags from those two DockerHub repositories should be mapped to /cvmfs/singularity.opensciencegrid.org/. On the login node, try running:

ls /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow:2.3/

This is the image in its expanded form - something we can execute with Singularity!

Testing the container on the submit host

First, download the files contained in this tutorial to the login node using the git clone command and cd into the tutorial directory that is created:

git clone https://github.com/OSGConnect/tutorial-tensorflow-containers
cd tutorial-tensorflow-containers

Before submitting jobs to the OSG, it is always a good idea to test your code so that you understand runtime requirements. The containers can be tested on the OSGConnect submit hosts with singularity shell, which will drop you into a container and let you exlore it interactively. To explore the Tensorflow 2.3 image, run:

singularity shell /cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow:2.3/

Note how the command line prompt changes, providing you an indicator that you are inside the image. You can exit any time by running exit. Another important thing to note is that your $HOME directory is automatically mounted inside the interactive container - allowing you to access your codes and test it out. First, start with a simple python3 import test to make sure tensorflow is available:

$ python3
Python 3.6.9 (default, Jul 17 2020, 12:50:27) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2021-01-15 17:32:33.901607: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-01-15 17:32:33.901735: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
>>>

Tensorflow will warn you that no GPUs where found. This is expected as we do not have GPUs attached to our login nodes, and it is fine as Tensorflow works fine with regular CPUs (slower of course).

Exit out of Python3 with CTRL+D and then we can run a Tensorflow testcode which can be found in this tutorial:

$ python3 test.py 
2021-01-15 17:37:43.152892: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudart.so.10.1'; dlerror: libcudart.so.10.1: cannot open shared object file: No such file or directory
2021-01-15 17:37:43.153021: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2021-01-15 17:37:44.899967: W tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2021-01-15 17:37:44.900063: W tensorflow/stream_executor/cuda/cuda_driver.cc:312] failed call to cuInit: UNKNOWN ERROR (303)
2021-01-15 17:37:44.900130: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (login05.osgconnect.net): /proc/driver/nvidia/version does not exist
2021-01-15 17:37:44.900821: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-01-15 17:37:44.912483: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2700000000 Hz
2021-01-15 17:37:44.915548: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4fa0bf0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-01-15 17:37:44.915645: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-01-15 17:37:44.921895: I tensorflow/core/common_runtime/eager/execute.cc:611] Executing op MatMul in device /job:localhost/replica:0/task:0/device:CPU:0
tf.Tensor(
[[22. 28.]
 [49. 64.]], shape=(2, 2), dtype=float32)

We will again see a bunch of warnings regarding GPUs not being available, but as we can see by the /job:localhost/replica:0/task:0/device:CPU:0 line, the code ran on one of the CPUs. When testing your own code like this, take note of how much memory, disk and runtime is required - it is needed in the next step.

Once you are done with testing, use CTRL+D or run exit to exit out of the container. Note that you can not submit jobs from within the container.

Running a CPU job

If Tensorflow can run on GPUs, you might be wondering why we might want to run it on slower CPUs? One reason is that CPUs are plentiful while GPUs are still somewhat scarce. If you have a lot of shorter Tensorflow jobs, they might complete faster on available CPUs, rather than wait in the queue for the faster, less available, GPUs. The good news is that Tensorflow code should work in both enviroments automatically, so if your code runs too slow on CPUs, moving to GPUs should be easy.

To submit our job, we need a submit file and a job wrapper script. The submit file is a basic OSGConnect flavored HTCondor file, specifying that we want the job to run in a container. cpu-job.submit contains:

universe = vanilla

# Job requirements - ensure we are running on a Singularity enabled
# node and have enough resources to execute our code
# Tensorflow also requires AVX instruction set and a newer host kernel
Requirements = HAS_SINGULARITY == True && HAS_AVX2 == True && OSG_HOST_KERNEL_VERSION >= 31000
request_cpus = 1
request_gpus = 0
request_memory = 1 GB
request_disk = 1 GB

# Container image to run the job in
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow:2.3"

# Executable is the program your job will run It's often useful
# to create a shell script to "wrap" your actual work.
Executable = job-wrapper.sh
Arguments = 

# Inputs/outputs - in this case we just need our python code.
# If you leave out transfer_output_files, all generated files comes back
transfer_input_files = test.py
#transfer_output_files = 

# Error and output are the error and output channels from your job
# that HTCondor returns from the remote host.
Error = $(Cluster).$(Process).error
Output = $(Cluster).$(Process).output

# The LOG file is where HTCondor places information about your
# job's status, success, and resource consumption.
Log = $(Cluster).log

# Send the job to Held state on failure. 
#on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)

# Periodically retry the jobs every 1 hour, up to a maximum of 5 retries.
#periodic_release =  (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 60*60)

# queue is the "start button" - it launches any jobs that have been
# specified thus far.
queue 1

And job-wrapper.sh:

#!/bin/bash

set -e

# set TMPDIR variable
export TMPDIR=$_CONDOR_SCRATCH_DIR

echo
echo "I'm running on" $(hostname -f)
echo "OSG site: $OSG_SITE_NAME"
echo

python3 test.py 2>&1

The job can now be submitted with condor_submit cpu-job.submit. Once the job is done, check the files named after the job id for the outputs.

Running a GPU job

When moving the job to be run on a GPU, all we have to do is update two lines in the submit file: set request_gpus to 1 and specify a GPU enabled container image for +SingularityImage. The updated submit file can be found in gpu-job.submit with the contents:

universe = vanilla

# Job requirements - ensure we are running on a Singularity enabled
# node and have enough resources to execute our code
# Tensorflow also requires AVX instruction set and a newer host kernel
Requirements = HAS_SINGULARITY == True && HAS_AVX2 == True && OSG_HOST_KERNEL_VERSION >= 31000
request_cpus = 1
request_gpus = 1
request_memory = 1 GB
request_disk = 1 GB

# Container image to run the job in
+SingularityImage = "/cvmfs/singularity.opensciencegrid.org/opensciencegrid/tensorflow-gpu:2.3"

# Executable is the program your job will run It's often useful
# to create a shell script to "wrap" your actual work.
Executable = job-wrapper.sh
Arguments = 

# Inputs/outputs - in this case we just need our python code.
# If you leave out transfer_output_files, all generated files comes back
transfer_input_files = test.py
#transfer_output_files = 

# Error and output are the error and output channels from your job
# that HTCondor returns from the remote host.
Error = $(Cluster).$(Process).error
Output = $(Cluster).$(Process).output

# The LOG file is where HTCondor places information about your
# job's status, success, and resource consumption.
Log = $(Cluster).log

# Send the job to Held state on failure. 
#on_exit_hold = (ExitBySignal == True) || (ExitCode != 0)

# Periodically retry the jobs every 1 hour, up to a maximum of 5 retries.
#periodic_release =  (NumJobStarts < 5) && ((CurrentTime - EnteredCurrentStatus) > 60*60)

# queue is the "start button" - it launches any jobs that have been
# specified thus far.
queue 1

Submit a job with condor_submit gpu-job.submit. Once the job is complete, check the .out file for a line stating the code was run under a GPU. Something similar to:

2021-02-02 23:25:19.022467: I tensorflow/core/common_runtime/eager/execute.cc:611] Executing op MatMul in device /job:localhost/replica:0/task:0/device:GPU:0

The GPU:0 parts shows that a GPU was found and used for the computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Working with Tensorflow, GPUs, and containers

Defining container images

Adding a container to the OSG CVMFS distribution mechanism

Testing the container on the submit host

Running a CPU job

Running a GPU job

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
README.md		README.md
cpu-job.submit		cpu-job.submit
gpu-job.submit		gpu-job.submit
job-wrapper.sh		job-wrapper.sh
test.py		test.py

OSGConnect/tutorial-tensorflow-containers

Folders and files

Latest commit

History

Repository files navigation

Working with Tensorflow, GPUs, and containers

Defining container images

Adding a container to the OSG CVMFS distribution mechanism

Testing the container on the submit host

Running a CPU job

Running a GPU job

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages