八、作为优化目标的梯度一致性

在上一章中，我们了解了元 SGD 和 Reptile 算法。我们看到了如何使用元 SGD 查找最佳参数，最佳学习率和梯度更新方向。我们还看到了 Reptile 算法的工作原理以及比 MAML 更有效的方法。在本章中，我们将学习如何将梯度一致性用作元学习的优化目标。正如您在 MAML 中所看到的，我们基本上是对各个任务的梯度进行平均，并更新模型参数。在梯度一致性算法中，我们将对梯度进行加权平均以更新模型参数，并且我们将了解如何为梯度添加权重如何帮助我们找到更好的模型参数。在本章中，我们将确切探讨梯度一致性算法的工作原理。我们的梯度一致性算法可以同时插入 MAML 和 Reptile 算法。我们还将从头开始了解如何在 MAML 中实现梯度一致性。

在本章中，我们将学习以下内容：

梯度一致性
权重计算
梯度一致性算法
使用 MAML 构建梯度一致性算法

作为优化的梯度一致性

梯度一致性算法是一种有趣且最近引入的算法，可作为元学习算法的增强功能。在 MAML 和 Reptile 中，我们尝试找到一个更好的模型参数，该参数可在多个相关任务中推广，以便我们可以使用更少的数据点快速学习。如果我们回顾前面几章中学到的知识，就会发现我们随机初始化了模型参数，然后从任务分布p(T)中抽取了一批随机任务T[i]进行了采样。对于每个采样任务T[i]，我们通过计算梯度将损失降到最低，并获得更新的参数θ'[i]，这形成了我们的内部循环：

在为每个采样任务计算出最佳参数之后，我们执行元优化-也就是说，我们通过计算一组新任务中的损失来执行元优化，并通过针对最佳参数θ'[i]计算梯度来最大程度地减少损失，我们在内部循环中获得的，并更新了初始模型参数θ：

前面的方程式实际上是什么？如果仔细研究这个方程，您会注意到我们只是对各个任务的梯度求平均值，并更新我们的模型参数θ，这意味着所有任务在更新我们的模型参数方面均做出同等贡献。

但是，这怎么了？假设我们已经采样了四个任务，并且三个任务在一个方向上具有梯度更新，但是一个任务在一个方向上与其他任务完全不同的梯度更新。由于所有任务的坡度对更新模型参数的贡献均相等，因此这种分歧可能会对更新模型的初始参数产生严重影响。如下图所示，与其他任务相比，从T[1]到T[3]的所有任务在一个方向上具有梯度，但是任务T[4]在完全不同的方向上具有梯度：

那么，我们现在该怎么办？我们如何才能了解哪个任务具有很强的梯度一致性，哪些任务具有很强的分歧性？如果将权重与梯度相关联，是否可以理解其重要性？因此，我们通过将权重乘以每个梯度来重写外部梯度更新方程，如下所示：

好的，我们如何计算这些权重？这些权重与任务梯度的内积和采样批量任务中所有任务的梯度平均值的乘积成正比。但这意味着什么？

它暗示，如果任务的梯度与采样的一批任务中所有任务的平均梯度在同一方向上，则我们可以增加其权重，以便为更新模型参数做出更大的贡献。同样，如果任务的梯度方向与采样的任务批量中所有任务的平均梯度方向大不相同，则我们可以降低其权重，以便在更新模型参数时贡献较小。我们将在下一节中看到如何精确计算这些权重。

我们不仅可以将梯度一致性算法应用于 MAML，还可以应用于 Reptile 算法。因此，我们的 Reptile 更新方程如下：

权重计算

我们已经看到，通过将权重与梯度相关联，我们可以了解哪些任务具有强梯度一致性，哪些任务具有强梯度不一致。

我们知道，这些权重与任务梯度和采样任务批量中所有任务的梯度平均值的内积成正比。我们如何计算这些权重？

权重计算如下：

假设我们抽样了一批任务。然后，对于批量中的每个任务，我们对k个数据点进行采样，计算损失，更新梯度，并找到每个任务的最佳参数θ'[i]。与此同时，我们还将每个任务的梯度更新向量存储在g[i]中。可以计算为g[i] = θ - θ'[i]。

因此，第i个任务的权重是g[i]和g[j]的内积之和除以归一化因子。归一化因子与g[i]和g_avg的内积成正比。

通过查看以下代码，让我们更好地理解如何精确计算这些权重：

for i in range(num_tasks):
    g = theta - theta_[i]

#calculate normalization factor
normalization_factor = 0

for i in range(num_tasks):
     for j in range(num_tasks):
         normalization_factor += np.abs(np.dot(g[i].T, g[j]))

#calcualte weights 
w = np.zeros(num_tasks)

for i in range(num_tasks):
     for j in range(num_tasks):
         w[i] += np.dot(g[i].T, g[j])

     w[i] = w[i] / normalization_factor

算法

现在，让我们看一下梯度一致性的工作原理：

假设我们有一个由参数θ参数化的模型f和任务上的分布p(T)。首先，我们随机初始化模型参数θ。
我们从任务分布T ~ p(T)中采样了一些任务T[i]。假设我们采样了两个任务，然后是T。
内循环：对于任务（T）中的每个任务（T[i]），我们对k个数据点进行采样，并准备训练和测试数据集：

我们使用梯度下降来计算损失并使D_train[i]上的损失最小，并获得最佳参数θ'[i]：

。

与此同时，我们还将梯度更新向量存储为：g[i] = θ - θ'[i]。

因此，对于每个任务，我们对k个数据点进行采样，并最大程度地减少训练集D_train[i]上的损失，并获得最佳参数θ'[i]。当我们采样两个任务时，我们将有两个最佳参数θ'[i]，并且我们将为这两个任务中的每一个都有一个梯度更新向量g = {(θ - θ'[1]), (θ - θ'[2])}。

外循环：现在，在执行元优化之前，我们将按以下方式计算权重：

在计算权重之后，我们现在通过将权重与梯度相关联来执行元优化。通过计算相对于上一步中获得的参数的梯度，并将梯度与权重相乘，我们将D_test[i]中的损失最小化。

如果我们的元学习算法是 MAML，则更新公式如下：

如果我们的元学习算法是 Reptile，则更新方程如下：

对于n次迭代，我们重复步骤 2 至 5。

使用 MAML 构建梯度一致性算法

在上一节中，我们看到了梯度一致性算法的工作原理。我们看到了梯度一致性如何为梯度增加权重，从而说明其重要性。现在，我们将看到如何通过使用 NumPy 从头开始对它们进行编码，从而将梯度一致性算法与 MAML 结合使用。为了更好地理解，我们将考虑一个简单的二分类任务。我们将随机生成输入数据，使用简单的单层神经网络对其进行训练，然后尝试找到最佳参数θ。

现在，我们将逐步详细地了解如何执行此操作。

您也可以在此处以 Jupyter 笔记本的形式查看完整代码。

我们导入所有必要的库：

import numpy as np

生成数据点

现在，我们定义了一个名为sample_points的函数，用于生成输入(x, y)对。它以参数k作为输入，这意味着我们要采样的(x, y)对的数量：

def sample_points(k):
    x = np.random.rand(k,50)
    y = np.random.choice([0, 1], size=k, p=[.5, .5]).reshape([-1,1])
    return x,y

单层神经网络

为了简单起见和更好地理解，我们使用只有一层的神经网络来预测输出：

a = np.matmul(X, theta)
YHat = sigmoid(a)

因此，我们将梯度一致性与 MAML 结合使用，以找到可在各个任务之间通用的最佳参数值theta。这样一来，对于一项新任务，我们可以通过采取较少的梯度步骤，在较短的时间内从几个数据点中学习。

MAML 中的梯度一致性

现在，我们将定义一个名为GradientAgreement_MAML的类，在其中将实现梯度一致性 MAML 算法。在__init__方法中，我们将初始化所有必需的变量。然后，我们将定义 Sigmoid 激活函数。接下来，我们将定义train函数。

让我们一步一步看一下，然后看一下整体代码：

class GradientAgreement_MAML(object):

我们定义__init__方法并初始化所有变量：

    def __init__(self):

        #initialize number of tasks i.e number of tasks we need in each batch of tasks
        self.num_tasks = 2

        #number of samples i.e number of shots -number of data points (k) we need to have in each task
        self.num_samples = 10

        #number of epochs i.e training iterations
        self.epochs = 100

        #hyperparameter for the inner loop (inner gradient update)
        self.alpha = 0.0001

        #hyperparameter for the outer loop (outer gradient update) i.e meta optimization
        self.beta = 0.0001

        #randomly initialize our model parameter theta
        self.theta = np.random.normal(size=self.pol_ord).reshape(self.pol_ord, 1)

现在，我们定义一个名为sigmoid的函数，用于将x转换为多项式形式：

    def sigmoid(self,a):
        return 1.0 / (1 + np.exp(-a))

现在，让我们定义一个称为train的函数进行训练：

    def train(self):

对于周期数，我们执行以下操作：

        for e in range(self.epochs): 

            self.theta_ = []

            #for storing gradient updates
            self.g = []

对于一批任务中的任务i，我们执行以下操作：

            for i in range(self.num_tasks):

我们对k个数据点进行采样，并准备我们的训练集D_train[i]：

                XTrain, YTrain = sample_points(self.num_samples)

我们预测YHat的值：

                a = np.matmul(XTrain, self.theta)

                YHat = self.sigmoid(a)

我们使用梯度下降计算损失并使损失最小化：

                #since we're performing classification, we use cross entropy loss as our loss function
                loss = ((np.matmul(-YTrain.T, np.log(YHat)) - np.matmul((1 -YTrain.T), np.log(1 - YHat)))/self.num_samples)[0][0]

                #minimize the loss by calculating gradients
                gradient = np.matmul(XTrain.T, (YHat - YTrain)) / self.num_samples

                #update the gradients and find the optimal parameter theta' for each of tasks
                self.theta_.append(self.theta - self.alpha*gradient)

我们将梯度更新存储在g和g[i] = θ - θ'中：

                self.g.append(self.theta-self.theta_[i])

现在，我们计算权重：

            normalization_factor = 0

            for i in range(self.num_tasks):
                for j in range(self.num_tasks): 
                    normalization_factor += np.abs(np.dot(self.g[i].T, self.g[j]))

            w = np.zeros(self.num_tasks)

            for i in range(self.num_tasks):

                for j in range(self.num_tasks):
                    w[i] += np.dot(self.g[i].T, self.g[j])

                w[i] = w[i] / normalization_factor

我们初始化加权元梯度：

            weighted_gradient = np.zeros(self.theta.shape)

对于任务数量，我们对k个数据点进行采样，并准备测试集D_test[i]：

            for i in range(self.num_tasks):

               #sample k data points and prepare our test set for meta training
                XTest, YTest = sample_points(10)

我们预测y的值：

                a = np.matmul(XTest, self.theta_[i])

                YPred = self.sigmoid(a)

我们计算元梯度：

                meta_gradient = np.matmul(XTest.T, (YPred - YTest)) / self.num_samples

将权重乘以计算出的元梯度，并使用这个更新θ的值：

                weighted_gradient += np.sum(w[i]*meta_gradient)

            self.theta = self.theta-self.beta*weighted_gradient/self.num_tasks

我们每 10 个周期打印一次损失：

            if e%10==0:
                print "Epoch {}: Loss {}\n".format(e,loss) 
                print 'Updated Model Parameter Theta\n'
                print 'Sampling Next Batch of Tasks \n'
                print '---------------------------------\n'

以下是GradientAgreement_MAML的整个类：

class GradientAgreement_MAML(object):
    def __init__(self):

        #initialize number of tasks i.e number of tasks we need in each batch of tasks
        self.num_tasks = 2

        #number of samples i.e number of shots -number of data points (k) we need to have in each task
        self.num_samples = 10

        #number of epochs i.e training iterations
        self.epochs = 100

        #hyperparameter for the inner loop (inner gradient update)
        self.alpha = 0.0001

        #hyperparameter for the outer loop (outer gradient update) i.e meta optimization
        self.beta = 0.0001

        #randomly initialize our model parameter theta
        self.theta = np.random.normal(size=50).reshape(50, 1)

    #define our sigmoid activation function 
    def sigmoid(self,a):
        return 1.0 / (1 + np.exp(-a))

    #now Let's get to the interesting part i.e training :P
    def train(self):

        #for the number of epochs,
        for e in range(self.epochs): 

            self.theta_ = []

            #for storing gradient updates
            self.g = []

            #for task i in batch of tasks
            for i in range(self.num_tasks):

                #sample k data points and prepare our train set
                XTrain, YTrain = sample_points(self.num_samples)

                a = np.matmul(XTrain, self.theta)

                YHat = self.sigmoid(a)

                #since we're performing classification, we use cross entropy loss as our loss function
                loss = ((np.matmul(-YTrain.T, np.log(YHat)) - np.matmul((1 -YTrain.T), np.log(1 - YHat)))/self.num_samples)[0][0]

                #minimize the loss by calculating gradients
                gradient = np.matmul(XTrain.T, (YHat - YTrain)) / self.num_samples

                #update the gradients and find the optimal parameter theta' for each of tasks
                self.theta_.append(self.theta - self.alpha*gradient)

                #compute the gradient update
                self.g.append(self.theta-self.theta_[i])

           #now we calculate the weights
           #we know that weight is the sum of dot product of g_i and g_j divided by a normalization factor. 

            normalization_factor = 0

            for i in range(self.num_tasks):
                for j in range(self.num_tasks): 
                    normalization_factor += np.abs(np.dot(self.g[i].T, self.g[j]))

            w = np.zeros(self.num_tasks)

            for i in range(self.num_tasks):

                for j in range(self.num_tasks):
                    w[i] += np.dot(self.g[i].T, self.g[j])

                w[i] = w[i] / normalization_factor

            #initialize meta gradients
            weighted_gradient = np.zeros(self.theta.shape)

            for i in range(self.num_tasks):

                #sample k data points and prepare our test set for meta training
                XTest, YTest = sample_points(10)

                #predict the value of y
                a = np.matmul(XTest, self.theta_[i])

                YPred = self.sigmoid(a)

                #compute meta gradients
                meta_gradient = np.matmul(XTest.T, (YPred - YTest)) / self.num_samples

                weighted_gradient += np.sum(w[i]*meta_gradient)

            #update our randomly initialized model parameter theta with the meta gradients
            self.theta = self.theta-self.beta*weighted_gradient/self.num_tasks

            if e%10==0:
                print "Epoch {}: Loss {}\n".format(e,loss) 
                print 'Updated Model Parameter Theta\n'
                print 'Sampling Next Batch of Tasks \n'
                print '---------------------------------\n'

我们为GradientAgreement_MAML类创建一个实例：

model = GradientAgreement_MAML()

然后，我们训练模型：

model.train()

您会看到损失随着时间的推移而减少：

Epoch 0: Loss 5.9436043239

Updated Model Parameter Theta

Sampling Next Batch of Tasks 

---------------------------------

Epoch 10: Loss 3.905350606769

Updated Model Parameter Theta

Sampling Next Batch of Tasks 

---------------------------------

Epoch 20: Loss 2.0736155578

Updated Model Parameter Theta

Sampling Next Batch of Tasks 

---------------------------------

Epoch 30: Loss 1.48478751777

Updated Model Parameter Theta

Sampling Next Batch of Tasks 

---------------------------------

总结

在本章中，我们学习了梯度一致性算法。我们已经看到了梯度一致性算法如何使用加权梯度来找到更好的初始模型参数θ。我们还看到了这些权重如何与任务梯度的内积和采样批量任务中所有任务的梯度平均值的乘积成正比。我们还探讨了如何将梯度一致性算法与 MAML 和 Reptile 算法结合使用。之后，我们看到了如何使用梯度一致性算法在分类任务中找到最佳参数θ'[i]。

在下一章中，我们将了解元学习的最新进展，例如与任务无关的元学习，在概念空间中学习以及元模仿学习。

问题

什么是梯度一致性和分歧？
梯度一致性中 MAML 的更新方程是什么？
梯度一致性中的权重是多少？
权重如何计算？
什么是归一化因子？
我们什么时候增加和减少权重？

进一步阅读

梯度一致性算法论文

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!