[toc]
model finetune
- model finetune是基于BERT预训练模型强大的通用语义能力,使用具体业务场景的训练数据做finetune,从而针对性地修正网络参数,是典型的双阶段方法。(BERT在美团搜索核心排序的探索和实践)
- 在BERT预训练模型结构相对稳定的情况下,算法工程师做文章的是模型的输入和输出。首先需要了解BERT预训练时输入和输出的特点,BERT的输入是词向量、段向量、位置向量的特征融合(embedding相加或拼接),并且有[CLS]开头符和[SEP]结尾符表示句间关系;输出是各个位置的表示向量。finetune的主要方法有双句分类、单句分类、问答QA、单句标注,区别在于输入是单句/双句;需要监督的输出是 开头符表示向量作为分类信息 或 结合分割符截取部分输出做自然语言预测。
- 搜索中finetune的应用:model finetune应用于query-doc语义匹配任务,即搜索相关性问题和embedding服务。在召回and粗排之后,需要用BERT精排返回一个相关性分数,这一问题和语句分类任务有相似性。搜索finetune的手法有以下特点:
- 广泛挖掘有收益的finetune素材:有效的包括发布号embedding、文章摘要、作者名,训练手段包括直接输入、预处理。model finetune方法能在标注数据的基础上,利用更多的挖掘数据优化模型。
- 改造模型输入or输出
- 模型输入
- 简单的title+summary+username+query拼接
- 多域分隔:“考虑到title和summary对于query的相关性是类似的分布,username和query的相关性关联是潜在的。所以给user_name单独设了一个域,用sep分隔”
- 模型输出
- 门过滤机制,用某些表示向量的相应分数加权CLS的语句类型输出分
- 引入UE,直接和CLS输出向量concat
- 模型输入
- 素材的进一步处理,引入无监督学习
- 在model finetune的有监督训练之前,利用text rank算法处理finetune素材,相当于利用无监督学习提升了挖掘数据 —— 喂入BERT的数据的质量。
- 截断摘要,实测有效
- Bert训练任务的设计方式对模型效果影响大
- 将finetune进一步分为两阶段,把质量较低、挖掘的数据放在第一阶段finetune,质量高的标注数据放在第二阶段finetune,优化finetune的整体效果。
- 这种递进的训练技巧在BERT中较常见,论文中也有将长度较短的向量放在第一阶段训练的方法。
from tensorflow.keras.datasets import mnist
# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = mnist.load_data()
import matplotlib.pyplot as plt
image = x_train[0]
plt.imshow(image, cmap='gray')
# Flattening the Image Data
x_train = x_train.reshape(60000, 784)
x_test = x_test.reshape(10000, 784)
# Normalization
x_train = x_train / 255
x_test = x_test / 255
import tensorflow.keras as keras
num_categories = 10
y_train = keras.utils.to_categorical(y_train, num_categories)
y_test = keras.utils.to_categorical(y_test, num_categories)
# instantiating the model
from tensorflow.keras.models import Sequential
model = Sequential()
from tensorflow.keras.layers import Dense
model.add(Dense(units=512, activation='relu', input_shape=(784,)))
model.add(Dense(units = 512, activation='relu'))
model.add(Dense(units = 10, activation='softmax'))
model.summary()
model.compile(loss='categorical_crossentropy', metrics=['accuracy'])
history = model.fit(x_train, y_train,
epochs=5,
verbose=1,
validation_data=(x_test, y_test))
One-hot编码
def label2OH(y, D_out):
N = y.shape[0]
OH = np.zeros((N, D_out))
OH[np.arange(N), y] = 1
return OH
def OH2label(OH):
if(torch.is_tensor(OH)):
y = OH.argmax(dim=1)
else:
y = OH.argmax(axis=1)
return y
Image Classification of an American Sign Language Dataset
import pandas as pd
train_df = pd.read_csv("asl_data/sign_mnist_train.csv")
test_df = pd.read_csv("asl_data/sign_mnist_test.csv")
train_df.head()
y_train = train_df['label']
y_test = test_df['label']
del train_df['label']
del test_df['label']
x_train = train_df.values
x_test = test_df.values
import matplotlib.pyplot as plt
plt.figure(figsize=(40,40))
num_images = 20
for i in range(num_images):
row = x_train[i]
label = y_train[i]
image = row.reshape(28,28)
plt.subplot(1, num_images, i+1)
plt.title(label, fontdict={'fontsize': 30})
plt.axis('off')
plt.imshow(image, cmap='gray')
x_train = x_train / 255
x_test = x_test / 255
import tensorflow.keras as keras
num_classes = 25
CNN
x_train = x_train.reshape(-1,28,28,1)
x_test = x_test.reshape(-1,28,28,1)
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Conv2D , MaxPool2D , Flatten , Dropout , BatchNormalization
num_classes = 25
model = Sequential()
model.add(Conv2D(75 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu' , input_shape = (28,28,1)))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
model.add(Conv2D(50 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu'))
model.add(Dropout(0.2))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
model.add(Conv2D(25 , (3,3) , strides = 1 , padding = 'same' , activation = 'relu'))
model.add(BatchNormalization())
model.add(MaxPool2D((2,2) , strides = 2 , padding = 'same'))
model.add(Flatten())
model.add(Dense(units = 512 , activation = 'relu'))
model.add(Dropout(0.3))
model.add(Dense(units = num_classes , activation = 'softmax'))
data augmentation
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images horizontally
vertical_flip=False) # Don't randomly flip images vertically
datagen.fit(x_train)
model.compile(loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(datagen.flow(x_train,y_train, batch_size=32), # Default batch_size is 32. We set it here for clarity.
epochs=20,
steps_per_epoch=len(x_train)/32, # Run same number of steps we would if we were not using a generator.
validation_data=(x_test, y_test))
model.save('asl_model')
model = keras.models.load_model('asl_model')
from tensorflow.keras.preprocessing import image as image_utils
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
def show_image(image_path):
image = mpimg.imread(image_path)
plt.imshow(image)
def predict_letter(file_path):
show_image(file_path)
image = load_and_scale_image(file_path)
image = image_utils.img_to_array(image)
image = image.reshape(1,28,28,1)
image = image/255
prediction = model.predict(image)
# convert prediction to letter
predicted_letter = dictionary[np.argmax(prediction)]
return predicted_letter
from tensorflow.keras.applications.vgg16 import preprocess_input
image = preprocess_input(image)
from tensorflow.keras.applications.vgg16 import decode_predictions
print('Predicted:', decode_predictions(predictions, top=3))
Transfer Learning
from tensorflow import keras
base_model = keras.applications.VGG16(
weights='imagenet', # Load weights pre-trained on ImageNet.
input_shape=(224, 224, 3),
include_top=False)
base_model.trainable = False
inputs = keras.Input(shape=(224, 224, 3))
# Separately from setting trainable on the model, we set training to False
x = base_model(inputs, training=False)
x = keras.layers.GlobalAveragePooling2D()(x)
# A Dense classifier with a single unit (binary classification)
outputs = keras.layers.Dense(1)(x)
model = keras.Model(inputs, outputs)
# Important to use binary crossentropy and binary accuracy as we now have a binary classification problem
model.compile(loss=keras.losses.BinaryCrossentropy(from_logits=True), metrics=[keras.metrics.BinaryAccuracy()])
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# create a data generator
datagen = ImageDataGenerator(
samplewise_center=True, # set each sample mean to 0
rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180)
zoom_range = 0.1, # Randomly zoom image
width_shift_range=0.1, # randomly shift images horizontally (fraction of total width)
height_shift_range=0.1, # randomly shift images vertically (fraction of total height)
horizontal_flip=True, # randomly flip images
vertical_flip=False) # we don't expect Bo to be upside-down so we will not flip vertically
# load and iterate training dataset
train_it = datagen.flow_from_directory('presidential_doggy_door/train/',
target_size=(224, 224),
color_mode='rgb',
class_mode='binary',
batch_size=8)
# load and iterate test dataset
test_it = datagen.flow_from_directory('presidential_doggy_door/test/',
target_size=(224, 224),
color_mode='rgb',
class_mode='binary',
batch_size=8)
model.fit(train_it, steps_per_epoch=12, validation_data=test_it, validation_steps=4, epochs=20)
finetune
# Unfreeze the base model
base_model.trainable = True
# It's important to recompile your model after you make any changes
# to the `trainable` attribute of any inner layer, so that your changes
# are taken into account
model.compile(optimizer=keras.optimizers.RMSprop(learning_rate = .00001), # Very low learning rate
loss=keras.losses.BinaryCrossentropy(from_logits=True),
metrics=[keras.metrics.BinaryAccuracy()])
model.fit(train_it, steps_per_epoch=12, validation_data=test_it, validation_steps=4, epochs=10)
headline generator
pretrained word embedding, GPT2, BERT
import os
import pandas as pd
nyt_dir = 'nyt_dataset/articles/'
all_headlines = []
for filename in os.listdir(nyt_dir):
if 'Articles' in filename:
# Read in all of the data from the CSV file
headlines_df = pd.read_csv(nyt_dir + filename)
# Add all of the headlines to our list
all_headlines.extend(list(headlines_df.headline.values))
# Remove all headlines with the value of "Unknown"
all_headlines = [h for h in all_headlines if h != "Unknown"]
len(all_headlines)
from tensorflow.keras.preprocessing.text import Tokenizer
# Tokenize the words in our headlines
tokenizer = Tokenizer()
tokenizer.fit_on_texts(all_headlines)
total_words = len(tokenizer.word_index) + 1
print('Total words: ', total_words)
# Print a subset of the word_index dictionary created by Tokenizer
subset_dict = {key: value for key, value in tokenizer.word_index.items() \
if key in ['a','man','a','plan','a','canal','panama']}
print(subset_dict)
tokenizer.texts_to_sequences(['a','man','a','plan','a','canal','panama'])
# Convert data to sequence of tokens
input_sequences = []
for line in all_headlines:
# Convert our headline into a sequence of tokens
token_list = tokenizer.texts_to_sequences([line])[0]
# Create a series of sequences for each headline
for i in range(1, len(token_list)):
partial_sequence = token_list[:i+1]
input_sequences.append(partial_sequence)
print(tokenizer.sequences_to_texts(input_sequences[:5]))
input_sequences[:5]
# Convert data to sequence of tokens
input_sequences = []
for line in all_headlines:
# Convert our headline into a sequence of tokens
token_list = tokenizer.texts_to_sequences([line])[0]
# Create a series of sequences for each headline
for i in range(1, len(token_list)):
partial_sequence = token_list[:i+1]
input_sequences.append(partial_sequence)
print(tokenizer.sequences_to_texts(input_sequences[:5]))
input_sequences[:5]
input_sequences
# padding sequences
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
# Determine max sequence length
max_sequence_len = max([len(x) for x in input_sequences])
# Pad all sequences with zeros at the beginning to make them all max length
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
input_sequences[0]
from tensorflow.keras import utils
# Predictors are every word except the last
predictors = input_sequences[:,:-1]
# Labels are the last word
labels = input_sequences[:,-1]
labels = utils.to_categorical(labels, num_classes=total_words)
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.models import Sequential
# Input is max sequence length - 1, as we've removed the last word for the label
input_len = max_sequence_len - 1
model = Sequential()
# Add input embedding layer
model.add(Embedding(total_words, 10, input_length=input_len))
# Add LSTM layer with 100 units
model.add(LSTM(100))
model.add(Dropout(0.1))
# Add output layer
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')
tf.keras.preprocessing.text.Tokenizer(
num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True,
split=' ', char_level=False, oov_token=None, document_count=0, **kwargs
)
def predict_next_token(seed_text):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
prediction = model.predict_classes(token_list, verbose=0)
return prediction
prediction = predict_next_token("today in new york")
prediction
tokenizer.sequences_to_texts([prediction])
def generate_headline(seed_text, next_words=1):
for _ in range(next_words):
# Predict next token
prediction = predict_next_token(seed_text)
# Convert token to word
next_word = tokenizer.sequences_to_texts([prediction])[0]
# Add next word to the headline. This headline will be used in the next pass of the loop.
seed_text += " " + next_word
# Return headline as title-case
return seed_text.title()
seed_texts = [
'washington dc is',
'today in new york',
'the school district has',
'crime has become']
for seed in seed_texts:
print(generate_headline(seed, next_words=5))
- 与梯度下降法不同,随机梯度下降法并不使用整个数据集而是使用较小的数据子集(称为一个批次,即batch;其大小称为 batch size)来计算损失函数。这对我们算法的性能有着深远的影响。由于每个批次里的数据是从数据集里随机抽取的,所以每个批次的数据集都不相同。即使对于同一组权重,这些批次的数据集也会提供不同的梯度,引入一定程度的噪声
- 这种噪声实际上是非常有益的,因为它所产生的极小值的数学特性与梯度下降大相径庭。这在多 GPU 训练问题中之所以重要,是因为通过增加参与训练过程的 GPU 数量,我们实际上加大了批量(batch size),而这会导致减少有益的噪声
# This section generates the training dataset as defined by the variables in the section above.
x = np.random.uniform(0, 10, n_samples)
y = np.array([w_gen * (x + np.random.normal(loc=mean_gen, scale=std_gen, size=None)) + b_gen for x in x])
# Create the placeholders for the data to be used.
X = tf.placeholder(tf.float32, name="X")
Y = tf.placeholder(tf.float32, name="Y")
# Create our model variables w (weights; this is intended to map to the slope, w_gen) and b (bias; this maps to the intercept, b_gen).
# For simplicity, we initialize the data to zero.
w = tf.Variable(0.0, name="weights")
b = tf.Variable(0.0, name="bias")
# Define our model. We are implementing a simple linear neuron as per the diagram shown above.
Y_predicted = w * X + b
# Define a gradient descent optimizer
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001).minimize(loss)
# Define the maximum number of times we want to process the entire dataset (the number of epochs).
# In practice we won't run this many because we'll implement an early stopping condition that
# detects when the training process has converged.
max_number_of_epochs = 1000
# We still store information about the optimization process here.
loss_array = []
b_array = []
w_array = []
with tf.Session() as sess:
# Initialize the necessary variables
sess.run(tf.global_variables_initializer())
# Print out the parameters and loss before we do any training
w_value, b_value, loss_value = sess.run([w, b, loss], feed_dict={X: x, Y: y})
print("Before training: w = {:4.3f}, b = {:4.3f}, loss = {:7.3f}".format(w_value, b_value, loss_value))
print("")
print("Starting training")
print("")
# Start the training process
for i in range(max_number_of_epochs):
# Use the entire dataset to calculate the gradient and update the parameters
sess.run(optimizer, feed_dict={X: x, Y: y})
# Capture the data that we will use in our visualization
w_value, b_value, loss_value = sess.run([w, b, loss], feed_dict={X: x, Y: y})
w_array.append(w_value)
b_array.append(b_value)
loss_array.append(loss_value)
# At the end of every few epochs print out the learned weights
if (i + 1) % 5 == 0:
print("Epoch = {:2d}: w = {:4.3f}, b = {:4.3f}, loss = {:7.3f}".format(i+1, w_value, b_value, loss_value))
# Implement your convergence check here, and exit the training loop if
# you detect that we are converged:
if FIXME: # TODO
break
print("")
print("Training finished after {} epochs".format(i+1))
print("")
print("After training: w = {:4.3f}, b = {:4.3f}, loss = {:7.3f}".format(w_value, b_value, loss_value))
# adjust batch size
batch_size = 32
num_batches_in_epoch = (n_samples + batch_size - 1) // batch_size
研究训练速度和 batch_size 的关系
- 非常小或非常大的批量对于模型训练的收敛来说可能不是的最佳选择(非常小的批量带来的噪声往往过于嘈杂而无法使模型充分收敛到损失函数的最小值,而非常大的批量则往往造成训练的早期阶段就发散)
- 观察到大batch size的val_acc和acc很接近,不容易过拟合,但后期准确度效果提升缓慢
- Machine-Learning/GPU_training_batch_size.py
多GPU训练
# CPU training
CUDA_VISIBLE_DEVICES= python fashion_mnist.py --epochs 3 --batch-size 512
# GPU training
horovodrun -np $num_gpus python fashion_mnist.py --epochs 3 --batch-size 512
-
Horovod是一种最初由Uber开发的开源工具,旨在满足他们许多工程团队对更快的深度学习模型训练的需求。它是跨框架的分布式深度学习库,支持多种框架、高性能算法、高性能网络(RDMA、GPUDirect),也是分布式训练方法不断发展的生态系统(包括Distributed TensorFlow) 的一部分。Uber开发的这种解决方案利用MPI进行分布式进程间通信,并利用NVIDIA联合通信库(NCCL),以高度优化的方式实现跨分布式进程和节点的平均值计算。 由此产生的Horovod软件包实现了它的目标:仅需进行少量代码修改和直观的调试即可在多个GPU和多个节点上扩展深度学习模型的训练。
自2017年开始实施以来,Horovod已显著成熟,将其支持范围从TensorFlow扩展到了Keras,PyTorch和Apache MXNet。 Horovod经过了广泛的测试,迄今已用于一些最大的深度学习训练当中。例如,在Summit系统上支持 exascale 深度学习,可扩展到 27,000多个V100 GPU
- 支持多种框架
import horovod.tensorflow as hvd
import horovod.keras as hvd
import horovod.tensorflow.keras as hvd
import horovod.torch as hvd
import horovod.mxnet as hvd
Horovod与MPI的渊源
-
Horovod与MPI具有非常深厚的联系。对于熟悉MPI编程的程序员来说,您对通过Horovod实现的分布式模型训练会感到非常熟悉。对于那些不熟悉MPI编程的人来说,简短地讨论一下Horovod或MPI分布式进程所需的一些约定和注意事项是值得的。
-
与MPI一样,Horovod严格遵循单程序多数据(SPMD)范例,即在同一文件或程序中实现多个进程的指令流。由于多个进程并行执行代码,因此我们必须注意竞赛条件以及这些进程间的同步。
-
Horovod为执行程序的每个进程分配一个唯一的数字ID或rank(来自MPI的概念)。rank是可以通过编程的方式获得的。通过以编程方式在代码中标识进程的rank,我们可以进一步采取以下步骤:
- 将该进程固定到自己的专属GPU上。
- 使用单个rank来广播需要所有ranks统一使用的值。
- 利用单个rank收集所有ranks产生的值和/或计算它们的均值。
- 利用一个rank来记录或写入磁盘。
# 同步初始状态的几种方式
# Method 1
callbacks.append(hvd.callbacks.BroadcastGlobalVariablesCallback(0))
model.fit_generator(train_iter,
steps_per_epoch=len(train_iter) // hvd.size(),
callbacks=callbacks, ...)
# Method 2
hooks = [hvd.BroadcastGlobalVariablesHook(0)]
with tf.train.MonitoredTrainingSession(hooks=hooks, …) as sess:
# Method 3
bcast_op = hvd.broadcast_global_variables(0) sess.run(bcast_op)
# 只由一个worker保留检查点
ckpt_dir = "/tmp/train_logs" if hvd.rank() == 0 else None
with tf.train.MonitoredTrainingSession(checkpoint_dir=ckpt_dir, …) as sess:
- 数据分区的方式:先洗牌再分区,workers按分区顺序读取;先洗牌,单worker从整个数据集随机读取
在 4 个有 4 块 GPU 卡的节点上运行:
$ mpirun -np 16 -H server1:4,server2:4,server3:4,server4:4 -bind-to none -map-by slot -mca pml ob1 -mca btl openib -mca btl_tcp_if_include eth0 \
-x NCCL_DEBUG=INFO -x NCCL_SOCKET_IFNAME=eth0 -x LD_LIBRARY_PATH -x ...\
python train.py
分布式SGD在算法方面的挑战
-
throughput ~ GPU num
- 深度学习的大规模训练通常以线性增加的理想情况为基准,Horovod和NCCL库在保持高吞吐量方面做得很好,但是他们的性能与所使用的硬件有着千丝万缕的联系。高带宽和低延迟的要求导致了NVLink互连的开发,它是本课程所使用的服务器用来互连一个节点上的多个GPU的方法。 NVIDIA DGX-2通过NVSwitch将这种互连又推进一步,该互连结构可以300GB/s的峰值双向带宽连接多达16个GPU。
-
critical batch size ~ gradient noise scale (openai)
-
对精度的影响:朴素的方法(比如不加data augmentation)会降低精度
-
应对策略
- 提高学习率:One weird trick for parallelizing convolutional neural networks
- 早期学习率热身: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour.
-
Batch Normalization
- BN通过最小化每个层的输入分布中的漂移来改善学习过程
- 提高学习速度并减少使用 Dropout 的需求
- 想法是针对每批数据对所有层的输入 进行规一化(这比简单地只对输入数据集进行规一化更为复杂)
- Ghost BN
- 计算更小批量的统计数据(“ghost 批量”)引入其他噪声
- 按 GPU 逐个单独执行批量归一化
- 将噪声添加至梯度
- 确保权重更新的协方差随着批量大小的变动保持不变
- 不会改变权重更新的平均值
$$\hat{g}=\frac{1}{M}\sum^{N}_{n\in B}g_n z_n$$
- 更长的高学习率训练时间
- 增加批量大小代替学习率衰减
- LARS – 按层自适应学习率调整