6. 并行

2015 年 11 月发布的第一个 TensorFlow 软件包，已准备好在具有可用 GPU 的服务器上运行，并同时在其中执行训练操作。 2016 年 2 月，更新添加了分布式和并行化处理的功能。

在这个简短的章节中，我将介绍如何使用 GPU。对于那些想要了解这些设备如何工作的读者，有些参考文献将在上一节中给出。但是，鉴于本书的介绍性，我不会详细介绍分布式版本，但对于那些感兴趣的读者，一些参考将在上一节中给出。

带有 GPU 的执行环境

支持 GPU 的 TensorFlow 软件包需要 CudaToolkit 7.0 和 CUDNN 6.5 V2。对于安装环境，我们建议访问 cuda 安装 [44] 网站，为了不会深入细节，同时信息也是最新的。

在 TensorFlow 中引用这些设备的方法如下：

/cpu:0：引用服务器的 CPU。
/gpu:0：服务器的 GPU（如果只有一个可用）。
/gpu:1：服务器的第二个 GPU，依此类推。

要知道我们的操作和张量分配在哪些设备中，我们需要创建一个sesion，选项log_device_placement为True。我们在下面的例子中看到它：

import tensorflow as tf
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)

sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
printsess.run(c)

当读者在计算机中测试此代码时，应出现类似的输出：

. . .
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -&gt; device: 0, name: Tesla K40c, pci bus id: 0000:08:00.0
. . .
b: /job:localhost/replica:0/task:0/gpu:0
a: /job:localhost/replica:0/task:0/gpu:0
MatMul: /job:localhost/replica:0/task:0/gpu:0
…
[[ 22.28.]
[ 49.64.]]
…

此外，使用操作的结果，它通知我们每个部分的执行位置。

如果我们想要在特定设备中执行特定操作，而不是让系统自动选择设备，我们可以使用变量tf.device来创建设备上下文，因此所有操作都在上下文将分配相同的设备。

如果我们在系统中拥有更多 GPU，则默认情况下将选择具有较低标识符的 GPU。如果我们想要在不同的 GPU 中执行操作，我们必须明确指定它。例如，如果我们希望先前的代码在 GPU#2 中执行，我们可以使用tf.device('/gpu:2')，如下所示：

import tensorflow as tf

with tf.device('/gpu:2'):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
c = tf.matmul(a, b)
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
printsess.run(c)

多个 GPU 的并行

如果我们有更多的 GPU，通常我们希望一起使用它们来并行地解决同样的问题。为此，我们可以构建我们的模型，来在多个 GPU 之间分配工作。我们在下一个例子中看到它：

import tensorflow as tf

c = []
for d in ['/gpu:2', '/gpu:3']:
with tf.device(d):
a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3])
b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2])
c.append(tf.matmul(a, b))
with tf.device('/cpu:0'):
sum = tf.add_n(c)

# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
print sess.run(sum)

正如我们所看到的，代码与前一代码相同，但现在我们有 2 个 GPU，由tf.device指定，它们执行乘法（两个 GPU 在这里都做同样的操作，以便简化示例代码），稍后 CPU 执行加法。假设我们将log_device_placement设置为true，我们可以在输出中看到，操作如何分配给我们的设备 [45]。

. . .
Device mapping:
/job:localhost/replica:0/task:0/gpu:0 -&gt; device: 0, name: Tesla K40c
/job:localhost/replica:0/task:0/gpu:1 -&gt; device: 1, name: Tesla K40c
/job:localhost/replica:0/task:0/gpu:2 -&gt; device: 2, name: Tesla K40c
/job:localhost/replica:0/task:0/gpu:3 -&gt; device: 3, name: Tesla K40c
. . .


. . .

Const_3: /job:localhost/replica:0/task:0/gpu:3
I tensorflow/core/common_runtime/simple_placer.cc:289] Const_3: /job:localhost/replica:0/task:0/gpu:3
Const_2: /job:localhost/replica:0/task:0/gpu:3
I tensorflow/core/common_runtime/simple_placer.cc:289] Const_2: /job:localhost/replica:0/task:0/gpu:3
MatMul_1: /job:localhost/replica:0/task:0/gpu:3
I tensorflow/core/common_runtime/simple_placer.cc:289] MatMul_1: /job:localhost/replica:0/task:0/gpu:3
Const_1: /job:localhost/replica:0/task:0/gpu:2
I tensorflow/core/common_runtime/simple_placer.cc:289] Const_1: /job:localhost/replica:0/task:0/gpu:2
Const: /job:localhost/replica:0/task:0/gpu:2
I tensorflow/core/common_runtime/simple_placer.cc:289] Const: /job:localhost/replica:0/task:0/gpu:2
MatMul: /job:localhost/replica:0/task:0/gpu:2
I tensorflow/core/common_runtime/simple_placer.cc:289] MatMul: /job:localhost/replica:0/task:0/gpu:2
AddN: /job:localhost/replica:0/task:0/cpu:0
I tensorflow/core/common_runtime/simple_placer.cc:289] AddN: /job:localhost/replica:0/task:0/cpu:0
[[44.56.]
[98.128.]]
. . .

GPU 的代码示例

为了总结这一简短的章节，我们提供了一段代码，其灵感来自 DamienAymeric 在 Github [46] 中共享的代码，计算An + Bn，n=10，使用 Python datetime包，将 1 GPU 的执行时间与 2 个 GPU 进行比较。

首先，我们导入所需的库：

import numpy as np
import tensorflow as tf
import datetime

我们使用numpy包创建两个带随机值的矩阵：

A = np.random.rand(1e4, 1e4).astype('float32')
B = np.random.rand(1e4, 1e4).astype('float32')

n = 10

然后，我们创建两个结构来存储结果：

c1 = []
c2 = []

接下来，我们定义matpow()函数，如下所示：

defmatpow(M, n):
    if n &lt; 1: #Abstract cases where n &lt; 1
       return M
    else:
       return tf.matmul(M, matpow(M, n-1))

正如我们所见，要在单个 GPU 中执行代码，我们必须按如下方式指定：

with tf.device('/gpu:0'):
    a = tf.constant(A)
    b = tf.constant(B)
    c1.append(matpow(a, n))
    c1.append(matpow(b, n))

with tf.device('/cpu:0'):
sum = tf.add_n(c1)

t1_1 = datetime.datetime.now()

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(sum)
t2_1 = datetime.datetime.now()

对于 2 个 GPU 的情况，代码如下：

with tf.device('/gpu:0'):
    #compute A^n and store result in c2
    a = tf.constant(A)
    c2.append(matpow(a, n))
 
with tf.device('/gpu:1'):
    #compute B^n and store result in c2
    b = tf.constant(B)
    c2.append(matpow(b, n))

with tf.device('/cpu:0'):
    sum = tf.add_n(c2) #Addition of all elements in c2, i.e. A^n + B^n
    t1_2 = datetime.datetime.now()

with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
    # Runs the op.
    sess.run(sum)
t2_2 = datetime.datetime.now()

最后，我们打印计算时间的结果：

print "Single GPU computation time: " + str(t2_1-t1_1)
print "Multi GPU computation time: " + str(t2_2-t1_2)

TensorFlow 的分布式版本

正如我之前在本章开头所说，2016 年 2 月，Google 发布了 TensorFlow 的分布式版本，该版本由 gRPC 支持，这是一个用于进程间通信的高性能开源 RPC 框架（TensorFlow 服务使用的相同协议）。

对于它的用法，必须构建二进制文件，因为此时包只提供源代码。鉴于本书的介绍范围，我不会在分布式版本中解释它，但如果读者想要了解它，我建议从 TensorFlow 的分布式版本的官网开始 [47]。

与前面的章节一样，本书中使用的代码可以在本书的 Github [48] 中找到。我希望本章足以说明如何使用 GPU 加速代码。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

6.md

6.md

6. 并行

带有 GPU 的执行环境

多个 GPU 的并行

GPU 的代码示例

TensorFlow 的分布式版本

Files

6.md

Latest commit

History

6.md

File metadata and controls

6. 并行

带有 GPU 的执行环境

多个 GPU 的并行

GPU 的代码示例

TensorFlow 的分布式版本