srt/17 - 4 - Stochastic Gradient Descent Convergence (12 min).srt

﻿1
00:00:00,493 --> 00:00:03,492
You now know about the stochastic gradient descent algorithm.
现在你已经知道了随机梯度下降算法
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:03,492 --> 00:00:09,907
But when you're running the algorithm, how do you make sure that it's completely debugged and is converging okay?
但是当你在运行这个算法时 你如何确保调试过程已经完成 并且收敛到合适的位置呢？

3
00:00:09,907 --> 00:00:15,813
Equally important, how do you tune the learning rate alpha with Stochastic Gradient Descent.
还有 同样重要的是 你怎样调整随机梯度下降中学习速率α的值

4
00:00:15,813 --> 00:00:25,950
In this video we'll talk about some techniques for doing these things, for making sure it's converging and for picking the learning rate alpha.
在这段视频中 我们会谈到一些方法来处理这些问题 确保它能收敛 以及选择合适的学习速率α

5
00:00:25,950 --> 00:00:30,600
Back when we were using batch gradient descent, our standard way for making sure that
回到我们之前批量梯度下降的算法 我们确定梯度下降已经收敛的一个标准方法

6
00:00:30,600 --> 00:00:36,493
gradient descent was converging was we would plot the optimization cost function as a function of the number of iterations.
是画出最优化的代价函数 关于迭代次数的变化

7
00:00:36,493 --> 00:00:44,366
So that was the cost function and we would make sure that this cost function is decreasing on every iteration.
这就是代价函数 我们要保证这个代价函数在每一次迭代中 都是下降的

8
00:00:44,366 --> 00:00:50,438
When the training set sizes were small, we could do that because we could compute the sum pretty efficiently.
当训练集比较小的时候 我们不难完成 因为要计算这个求和是比较方便的

9
00:00:50,438 --> 00:00:57,950
But when you have a massive training set size then you don't want to have to pause your algorithm periodically.
但当你的训练集非常大的时候 你不希望老是定时地暂停算法

10
00:00:57,950 --> 00:01:04,045
You don't want to have to pause stochastic gradient descent periodically in order to compute this cost function
来计算一遍这个求和

11
00:01:04,045 --> 00:01:07,442
since it requires a sum of your entire training set size.
因为这个求和计算需要考虑整个的训练集

12
00:01:07,442 --> 00:01:12,466
And the whole point of stochastic gradient was that you wanted to start to make progress
而随机梯度下降的算法是 你每次只考虑一个样本

13
00:01:12,466 --> 00:01:19,130
after looking at just a single example without needing to occasionally scan through your entire training set
然后就立刻进步一点点 不需要在算法当中 时不时地扫描一遍全部的训练集

14
00:01:19,130 --> 00:01:25,583
right in the middle of the algorithm, just to compute things like the cost function of the entire training set.
来计算整个训练集的代价函数

15
00:01:25,583 --> 00:01:32,472
So for stochastic gradient descent, in order to check the algorithm is converging, here's what we can do instead.
因此 对于随机梯度下降算法 为了检查算法是否收敛 我们可以进行下面的工作

16
00:01:32,472 --> 00:01:36,367
Let's take the definition of the cost that we had previously.
让我们沿用之前定义的cost函数

17
00:01:36,367 --> 00:01:42,647
So the cost of the parameters theta with respect to a single training example is just one half of the square error on that training example.
关于θ的cost函数 等于二分之一倍的训练误差的平方和

18
00:01:42,647 --> 00:01:49,754
Then, while stochastic gradient descent is learning, right before we train on a specific example.
然后 在随机梯度下降法学习的时 在我们对某一个样本进行训练前

19
00:01:49,754 --> 00:01:54,601
So, in stochastic gradient descent we're going to look at the examples xi, yi, in order, and
在随机梯度下降中 我们要关注样本(x(i),y(i))

20
00:01:54,601 --> 00:01:57,329
then sort of take a little update with respect to this example.
然后关于这个样本更新一小步 进步一点点

21
00:01:57,329 --> 00:02:04,095
And we go on to the next example, xi plus 1, yi plus 1, and so on, right?
然后再转向下一个样本 (x(i+1,) y(i+1))

22
00:02:04,095 --> 00:02:05,880
That's what stochastic gradient descent does.
随机梯度下降就是这样进行的

23
00:02:05,880 --> 00:02:15,024
So, while the algorithm is looking at the example xi, yi, but before it has updated the parameters theta
在算法扫描到样本(x(i),y(i)) 但在更新参数θ之前

24
00:02:15,024 --> 00:02:20,255
using that an example, let's compute the cost of that example.
使用这个样本 我们可以算出这个样本对应的cost函数

25
00:02:20,255 --> 00:02:23,577
Just to say the same thing again, but using slightly different words.
我再换一种方式表达一遍

26
00:02:23,577 --> 00:02:33,294
A stochastic gradient descent is scanning through our training set right before we have updated theta using a specific training example x(i) comma y(i)
当随机梯度下降法对训练集进行扫描时 在我们使用某个样本(x(i),y(i))来更新θ前

27
00:02:33,294 --> 00:02:38,198
let's compute how well our hypothesis is doing on that training example.
让我们来计算出 这个假设对这个训练样本的表现

28
00:02:38,198 --> 00:02:43,852
And we want to do this before updating theta because if we've just updated theta using example,
我要在更新θ前来完成这一步 原因是如果我们用这个样本更新θ以后

29
00:02:43,852 --> 00:02:49,061
you know, that it might be doing better on that example than what would be representative.
再让它在这个训练样本上预测 其表现就比实际上要更好了

30
00:02:49,061 --> 00:02:57,438
Finally, in order to check for the convergence of  stochastic gradient descent, what we can do is every, say, every thousand iterations,
最后 为了检查随机梯度下降的收敛性 我们要做的是 每1000次迭代

31
00:02:57,438 --> 00:03:01,511
we can plot these costs that we've been computing in the previous step.
我们可以画出前一步中计算出的cost函数

32
00:03:01,511 --> 00:03:07,450
We can plot those costs average over, say, the last thousand examples processed by the algorithm.
我们把这些cost函数画出来 并对算法处理的最后1000个样本的cost值求平均值

33
00:03:07,450 --> 00:03:12,714
And if you do this, it kind of gives you a running estimate of how well the algorithm is doing.
如果你这样做的话 它会很有效地帮你估计出

34
00:03:12,714 --> 00:03:17,049
on, you know, the last 1000 training examples that your algorithm has seen.
你的算法在最后1000个样本上的表现

35
00:03:17,049 --> 00:03:23,974
So, in contrast to computing J<u>train periodically which needed to scan through the entire training set.</u>
所以 我们不需要时不时地计算Jtrain 那样的话需要所有的训练样本

36
00:03:23,974 --> 00:03:27,973
With this other procedure, well, as part of stochastic gradient descent,
随机梯度下降法的这个步骤

37
00:03:27,973 --> 00:03:32,965
it doesn't cost much to compute these costs as well right before updating to parameter theta.
只需要在每次更新θ之前进行 也并不需要太大的计算量

38
00:03:32,965 --> 00:03:40,276
And all we're doing is every thousand integrations or so, we just average the last 1,000 costs that we computed and plot that.
要做的就是 每1000次迭代运算中 我们对最后1000个样本的cost值求平均然后画出来

39
00:03:40,276 --> 00:03:47,537
And by looking at those plots, this will allow us to check if stochastic gradient descent is converging.
通过观察这些画出来的图 我们就能检查出随机梯度下降是否在收敛

40
00:03:47,537 --> 00:03:51,708
So here are a few examples of what these plots might look like.
这是几幅画出来的图的例子

41
00:03:51,708 --> 00:03:55,519
Suppose you have plotted the cost average over the last thousand examples,
假如你已经画出了最后1000组样本的cost函数的平均值

42
00:03:55,519 --> 00:04:01,073
because these are averaged over just a thousand examples, they are going to be a little bit noisy and so,
由于它们都只是最后1000组样本的平均值 因此它们看起来会比较异常(noisy)

43
00:04:01,073 --> 00:04:03,873
it may not decrease on every single iteration.
因此cost的值不会在每一个迭代中都下降

44
00:04:03,873 --> 00:04:07,828
Then if you get a figure that looks like this, So the plot is noisy
假如你得到一种这样的图像 看起来是有噪声的

45
00:04:07,828 --> 00:04:11,721
because it's average over, you know, just a small subset, say a thousand training examples.
因为它是在一小部分样本 比如1000组样本中求的平均值

46
00:04:11,721 --> 00:04:17,283
If you get a figure that looks like this, you know that would be a pretty decent run with the algorithm,
如果你得到像这样的图 那么你应该判断这个算法是在下降的

47
00:04:17,283 --> 00:04:24,195
maybe, where it looks like the cost has gone down and then this plateau that looks kind of flattened out, you know, starting from around that point.
看起来代价值在下降 然后从大概这个点开始变得平缓

48
00:04:24,195 --> 00:04:29,603
look like, this is what your cost looks like then maybe your learning algorithm has converged.
这就是代价函数的大致走向 这基本说明你的学习算法已经收敛了

49
00:04:29,603 --> 00:04:34,252
If you want to try using a smaller learning rate, something you might see is that
如果你想试试更小的学习速率 那么你很有可能看到的是

50
00:04:34,252 --> 00:04:39,229
the algorithm may initially learn more slowly so the cost goes down more slowly.
算法的学习变得更慢了 因此代价函数的下降也变慢了

51
00:04:39,229 --> 00:04:47,585
But then eventually you have a smaller learning rate is actually possible for the algorithm to end up at a, maybe very slightly better solution.
但由于你使用了更小的学习速率 你很有可能会让算法收敛到一个可能更好的解

52
00:04:47,585 --> 00:04:53,426
So the red line may represent the behavior of stochastic gradient descent using a slower, using a smaller leaning rate.
红色的曲线代表随机梯度下降使用一个更小的学习速率

53
00:04:53,426 --> 00:05:00,594
And the reason this is the case is because, you remember, stochastic gradient descent doesn't just converge to the global minimum,
出现这种情况是因为 别忘了 随机梯度下降不是直接收敛到全局最小值

54
00:05:00,594 --> 00:05:05,068
is that what it does is the parameters will oscillate a bit around the global minimum.
而是在局部最小附近反复振荡

55
00:05:05,068 --> 00:05:09,231
And so by using a smaller learning rate, you'll end up with smaller oscillations.
所以使用一个更小的学习速率 最终的振荡就会更小

56
00:05:09,231 --> 00:05:12,896
And sometimes this little difference will be negligible
有时候这一点小的区别可以忽略

57
00:05:12,896 --> 00:05:19,686
and sometimes with a smaller than you can get a slightly better value for the parameters.
但有时候一点小的区别 你就会得到更好一点的参数

58
00:05:19,686 --> 00:05:22,269
Here are some other things that might happen.
接下来再看几种其他的情况

59
00:05:22,269 --> 00:05:27,986
Let's say you run stochastic gradient descent and you average over a thousand examples when plotting these costs.
假如你还是运行随机梯度下降 然后对1000组样本取cost函数的平均值 并且画出图像

60
00:05:27,986 --> 00:05:32,369
So, you know, here might be the result of another one of these plots.
那么这是另一种可能的图形

61
00:05:32,369 --> 00:05:34,353
Then again, it kind of looks like it's converged.
看起来这样还是已经收敛了

62
00:05:34,353 --> 00:05:42,119
If you were to take this number, a thousand, and increase to averaging over 5 thousand examples.
如果你把这个数 1000 提高到5000组样本

63
00:05:42,119 --> 00:05:47,913
Then it's possible that you might get a smoother curve that looks more like this.
那么可能你会得到一条更平滑的曲线

64
00:05:47,913 --> 00:05:56,547
And by averaging over, say 5,000 examples instead of 1,000, you might be able to get a smoother curve like this.
通过在5000个样本中求平均值 你会得到比刚才1000组样本更平滑的曲线

65
00:05:56,547 --> 00:06:00,248
And so that's the effect of increasing the number of examples you average over.
这是你增大平均的训练样本数的情形

66
00:06:00,248 --> 00:06:06,229
The disadvantage of making this too big of course is that now you get one date point only every 5,000 examples.
当然它的缺点就是 现在你的一个数据点都是5000组样本的平均结果

67
00:06:06,229 --> 00:06:12,001
And so the feedback you get on how well your learning learning algorithm is doing is, sort of, maybe it's more delayed
因此你所得到的关于学习算法表现的反馈 就显得有一些“延迟”

68
00:06:12,001 --> 00:06:17,681
because you get one data point on your plot only every 5,000 examples rather than every 1,000 examples.
因为你的每一个数据点是从5000个训练样本中得到的 而不是1000个样本

69
00:06:17,681 --> 00:06:23,911
Along a similar vein some times you may run a gradient descent and end up with a plot that looks like this.
沿着相似的脉络 有时候你运行梯度下降 可能也会得到这样的图像

70
00:06:23,911 --> 00:06:32,079
And with a plot that looks like this, you know, it looks like the cost just is not decreasing at all.
如果出现这种情况 你要知道 可能你的代价函数就没有在减小了

71
00:06:32,079 --> 00:06:34,023
It looks like the algorithm is just not learning.
也就是说 算法没有很好地学习

72
00:06:34,023 --> 00:06:39,261
It's just, looks like this here a flat curve and the cost is just not decreasing.
因为这看起来一直比较平坦 代价项并没有下降

73
00:06:39,261 --> 00:06:46,260
But again if you were to increase this to averaging over a larger number of examples
但同样地 如果你对这种情况时 也用更大量的样本进行平均

74
00:06:46,260 --> 00:06:49,729
it is possible that you see something like this red line
你很可能会观察到红线所示的情况

75
00:06:49,729 --> 00:06:55,127
it looks like the cost actually is decreasing, it's just that the blue line averaging over 2, 3 examples,
能看得出 实际上代价函数是在下降的 只不过蓝线用来平均的样本数量太小了

76
00:06:55,127 --> 00:07:01,374
the blue line was too noisy so you couldn't see the actual trend in the cost actually decreasing
并且蓝线太嘈杂 你看不出来一个确切的趋势代价是不是在下降

77
00:07:01,374 --> 00:07:06,688
and possibly averaging over 5,000 examples instead of 1,000 may help.
所以可能用5000组样本来平均 比用1000组样本来平均 更能看出趋势

78
00:07:06,688 --> 00:07:12,358
Of course we averaged over a larger number examples that we've averaged here over 5,000 examples,
当然 即使是使用一个较大的样本数量 比如我们用5000个样本来平均

79
00:07:12,358 --> 00:07:16,998
I'm just using a different color, it is also possible that you that see a learning curve ends up looking like this.
我用另一种颜色来表示 即使如此 你还是可能会发现 这条学习曲线是这样的

80
00:07:16,998 --> 00:07:21,197
That it's still flat even when you average over a larger number of examples.
它还是比较平坦 即使你用更多的训练样本

81
00:07:21,197 --> 00:07:25,908
And as you get that, then that's maybe just a more firm verification that
如果是这样的话 那可能就更肯定地说明

82
00:07:25,908 --> 00:07:29,287
unfortunately the algorithm just isn't learning much for whatever reason.
不知道出于什么原因 算法确实没怎么学习好

83
00:07:29,287 --> 00:07:34,969
And you need to either change the learning rate or change the features or change something else about the algorithm.
那么你就需要调整学习速率 或者改变特征量 或者改变其他的什么

84
00:07:34,969 --> 00:07:39,235
Finally, one last thing that you might see would be if you were to plot these curves
最后一种你可能会遇到的情况是 如果你画出曲线

85
00:07:39,235 --> 00:07:43,273
and you see a curve that looks like this, where it actually looks like it's increasing.
你会发现曲线是这样的 实际上是在上升

86
00:07:43,273 --> 00:07:48,066
And if that's the case then this is a sign that the algorithm is diverging.
这是一个很明显的信号 告诉你算法正在发散

87
00:07:48,066 --> 00:07:53,965
And what you really should do is use a smaller value of the learning rate alpha.
那么你要做的事 就是用一个更小一点的学习速率值

88
00:07:53,965 --> 00:07:58,143
So hopefully this gives you a sense of the range of phenomena you might see
好的 希望通过这几幅图 你能了解到

89
00:07:58,143 --> 00:08:02,946
when you plot these cost average over some range of examples as well as
当你画出cost函数在某个范围的训练样本中求平均值时 各种可能出现的现象

90
00:08:02,946 --> 00:08:07,765
suggests the sorts of things you might try to do in response to seeing different plots.
也告诉你 在遇到不同的情况时 应该采取怎样的措施

91
00:08:07,765 --> 00:08:15,070
So if the plots looks too noisy, or if it wiggles up and down too much, then try increasing the number of examples
所以如果曲线看起来噪声较大 或者老是上下振动

92
00:08:15,070 --> 00:08:18,734
you're averaging over so you can see the overall trend in the plot better.
那就试试增大你要平均的样本数量 这样应该就能得到比较好的变化趋势

93
00:08:18,734 --> 00:08:25,836
And if you see that the errors are actually increasing, the costs are actually increasing, try using a smaller value of alpha.
如果你发现代价值在上升 那么就换一个小一点的α值

94
00:08:25,836 --> 00:08:31,649
Finally, it's worth examining the issue of the learning rate just a little bit more.
最后还需要再说一下关于学习速率的问题

95
00:08:31,649 --> 00:08:38,922
We saw that when we run stochastic gradient descent, the algorithm will start here and sort of meander towards the minimum
我们已经知道 当运行随机梯度下降时 算法会从某个点开始 然后曲折地逼近最小值

96
00:08:38,922 --> 00:08:43,494
And then it won't really converge, and instead it'll wander around the minimum forever.
但它不会真的收敛 而是一直在最小值附近徘徊

97
00:08:43,494 --> 00:08:50,225
And so you end up with a parameter value that is hopefully close to the global minimum that won't be exact at the global minimum.
因此你最终得到的参数 实际上只是满足接近全局最小值 而不是真正的全局最小值

98
00:08:50,225 --> 00:08:57,991
In most typical implementations of stochastic gradient descent, the learning rate alpha is typically held constant.
在大多数随机梯度下降法的典型应用中 学习速率α一般是保持不变的

99
00:08:57,991 --> 00:09:02,022
And so what you we end up is exactly a picture like this.
因此你最终得到的结果一般来说是这个样子的

100
00:09:02,022 --> 00:09:06,523
If you want stochastic gradient descent to actually converge to the global minimum,
如果你想让随机梯度下降确实收敛到全局最小值

101
00:09:06,523 --> 00:09:11,825
there's one thing which you can do which is you can slowly decrease the learning rate alpha over time.
你可以随时间的变化减小学习速率α的值

102
00:09:11,825 --> 00:09:22,240
So, a pretty typical way of doing that would be to set alpha equals some constant 1 divided by iteration number plus constant 2.
所以 一种典型的方法来设置α的值 是让α等于某个常数1 除以 迭代次数加某个常数2

103
00:09:22,240 --> 00:09:28,169
So, iteration number is the number of iterations you've run of stochastic gradient descent,
迭代次数指的是你运行随机梯度下降的迭代次数

104
00:09:28,169 --> 00:09:29,519
so it's really the number of training examples you've seen
就是你算过的训练样本的数量

105
00:09:29,519 --> 00:09:34,103
And const 1 and const 2 are additional parameters of the algorithm
常数1和常数2是两个额外的参数

106
00:09:34,103 --> 00:09:38,160
that you might have to play with a bit in order to get good performance.
你需要选择一下 才能得到较好的表现

107
00:09:38,160 --> 00:09:43,004
One of the reasons people tend not to do this is because you end up needing to spend time
但很多人不愿意用这个办法的原因是 你最后会把问题落实到

108
00:09:43,004 --> 00:09:48,122
playing with these 2 extra parameters, constant 1 and constant 2, and so this makes the algorithm more finicky.
把时间花在确定常数1和常数2上 这让算法显得更苛刻

109
00:09:48,122 --> 00:09:52,113
You know, it's just more parameters able to fiddle with in order to make the algorithm work well.
也就是说 为了让算法更好 你要调整更多的参数

110
00:09:52,113 --> 00:09:57,246
But if you manage to tune the parameters well, then the picture you can get is that
但如果你能调整得到比较好的参数的话 你会得到的图形是

111
00:09:57,246 --> 00:10:02,834
the algorithm will actually around towards the minimum, but as it gets closer
你的算法会在最小值附近振荡 但当它越来越靠近最小值的时候

112
00:10:02,834 --> 00:10:07,024
because you're decreasing the learning rate the meanderings will get smaller and smaller
由于你减小了学习速率 因此这个振荡也会越来越小

113
00:10:07,024 --> 00:10:12,729
until it pretty much just to the global minimum. I hope this makes sense, right?
直到落到几乎靠近全局最小的地方 我想这么说能听懂吧？

114
00:10:12,729 --> 00:10:21,608
And the reason this formula makes sense is because as the algorithm runs, the iteration number becomes large So alpha will slowly become small,
这个公式起作用的原因是 随着算法的运行 迭代次数会越来越大 因此学习速率α会慢慢变小

115
00:10:21,608 --> 00:10:27,506
and so you take smaller and smaller steps until it hopefully converges to the global minimum.
因此你的每一步就会越来越小 直到最终收敛到全局最小值

116
00:10:27,506 --> 00:10:33,484
So If you do slowly decrease alpha to zero you can end up with a slightly better hypothesis.
所以 如果你慢慢减小α的值到0 你会最后得到一个更好一点的假设

117
00:10:33,484 --> 00:10:40,078
But because of the extra work needed to fiddle with the constants and because frankly usually we're pretty happy
但由于确定这两个常数需要更多的工作量 并且我们通常也对

118
00:10:40,078 --> 00:10:43,892
with any parameter value that is, you know, pretty close to the global minimum.
能够很接近全局最小值的参数 已经很满意了

119
00:10:43,892 --> 00:10:50,863
Typically this process of decreasing alpha slowly is usually not done and keeping the learning rate alpha constant
因此我们很少采用逐渐减小α的值的方法 在随机梯度下降中

120
00:10:50,863 --> 00:10:56,983
is the more common application of stochastic gradient descent although you will see people use either version.
你看到更多的还是让α的值为常数 虽然两种做法的人都有

121
00:10:56,983 --> 00:11:03,595
To summarize in this video we talk about a way for approximately monitoring
总结一下 这段视频中 我们介绍了一种方法

122
00:11:03,595 --> 00:11:08,256
how the stochastic gradient descent is doing in terms for optimizing the cost function.
 近似地监测出随机梯度下降算法在最优化代价函数中的表现

123
00:11:08,256 --> 00:11:17,043
And this is a method that does not require scanning over the entire training set periodically to compute the cost function on the entire training set.
这种方法不需要定时地扫描整个训练集 来算出整个样本集的代价函数

124
00:11:17,043 --> 00:11:20,693
But instead it looks at say only the last thousand examples or so.
而是只需要对最后1000个 或者多少个样本 进行一个平均

125
00:11:20,693 --> 00:11:27,592
And you can use this method both to make sure the stochastic gradient descent is okay and is converging
应用这种方法 你既可以保证随机梯度下降法正在正常运转和收敛

126
00:11:27,592 --> 00:11:31,468
or to use it to tune the learning rate alpha.
也可以用它来调整学习速率α的大小