srt/2 - 5 - Gradient Descent (11 min).srt

﻿1
00:00:00,000 --> 00:00:04,934
We've previously defined the cost function
J. In this video I want to tell you about
我们已经定义了代价函数J 而在这段视频中
(字幕整理：中国海洋大学 黄海广,haiguang2000@qq.com )

2
00:00:04,934 --> 00:00:09,634
an algorithm called gradient descent for
minimizing the cost function J. It turns
我想向你们介绍梯度下降这种算法 这种算法可以将代价函数J最小化

3
00:00:09,634 --> 00:00:14,275
out gradient descent is a more general
algorithm and is used not only in linear
梯度下降是很常用的算法 它不仅被用在线性回归上

4
00:00:14,275 --> 00:00:18,916
regression. It's actually used all over
the place in machine learning. And later
它实际上被广泛的应用于机器学习领域中的众多领域

5
00:00:18,916 --> 00:00:23,791
in the class we'll use gradient descent to
minimize other functions as well, not just
在后面课程中 为了解决其他线性回归问题 我们也将使用梯度下降法

6
00:00:23,791 --> 00:00:27,845
the cost function J, for linear regression.
So in this video, I'm going to
最小化其他函数 而不仅仅是只用在本节课的代价函数J

7
00:00:27,845 --> 00:00:32,558
talk about gradient descent for minimizing
some arbitrary function J. And then in
因此在这个视频中 我将讲解用梯度下降算法最小化函数 J

8
00:00:32,558 --> 00:00:37,406
later videos, we'll take those algorithm
and apply it specifically to the cost
在后面的视频中 我们还会将此算法应用于具体的

9
00:00:37,406 --> 00:00:43,332
function J that we had to find for linear
regression. So here's the problem
代价函数J中来解决线性回归问题 下面是问题概述

10
00:00:43,332 --> 00:00:48,112
setup. We're going to see that we have
some function J of (theta0, theta1).
在这里 我们有一个函数J(θ0, θ1)

11
00:00:48,112 --> 00:00:52,773
Maybe it's a cost function from linear
regression. Maybe it's some other function
也许这是一个线性回归的代价函数 也许是一些其他函数

12
00:00:52,773 --> 00:00:56,801
we want to minimize. And we want
to come up with an algorithm for
要使其最小化 我们需要用一个算法

13
00:00:56,801 --> 00:01:01,174
minimizing that as a function of J of
(theta0, theta1). Just as an aside,
来最小化函数J(θ0, θ1) 就像刚才说的

14
00:01:01,174 --> 00:01:05,793
it turns out that gradient descent
actually applies to more general
事实证明 梯度下降算法可应用于

15
00:01:05,793 --> 00:01:10,994
functions. So imagine if you have
a function that's a function of
多种多样的函数求解 所以想象一下如果你有一个函数

16
00:01:10,994 --> 00:01:16,194
J of (theta0, theta1, theta2, up to
some theta n). And you want to
J(θ0, θ1, θ2, ...,θn )

17
00:01:16,405 --> 00:01:21,795
minimize over (theta0 up to theta n)
of this J of (theta0 up to theta n).
你希望可以通过最小化 θ0到θn 来最小化此代价函数J(θ0 到θn)

18
00:01:21,795 --> 00:01:26,580
It turns out gradient descent
is an algorithm for solving
用n个θ是为了证明梯度下降算法可以解决更一般的问题

19
00:01:26,580 --> 00:01:31,368
this more general problem, but for the
sake of brevity, for the sake of
但为了简洁起见 为了简化符号

20
00:01:31,368 --> 00:01:35,935
your succinctness of notation, I'm just
going to present only two parameters
在接下来的视频中 我只用两个参数

21
00:01:36,113 --> 00:01:41,097
throughout the rest of this video. Here's
the idea for gradient descent. What we're
下面就是关于梯度下降的构想

22
00:01:41,097 --> 00:01:45,882
going to do is we're going to start off
with some initial guesses for theta0 and
我们要做的是 我们要开始对θ0和θ1 进行一些初步猜测

23
00:01:45,882 --> 00:01:50,788
theta1. Doesn't really matter what they
are, but a common choice would be if we
它们到底是什么其实并不重要 但通常的选择是将 θ0设为0

24
00:01:50,788 --> 00:01:55,452
set theta0 to 0, and
set theta1 to 0. Just initialize
将θ1也设为0 将它们都初始化为0

25
00:01:55,452 --> 00:02:00,322
them to 0. What we're going to do in
gradient descent is we'll keep changing
我们在梯度下降算法中要做的

26
00:02:00,322 --> 00:02:05,258
theta0 and theta1 a little bit to
try to reduce J of (theta0, theta1)
就是不停地一点点地改变 θ0和θ1 试图通过这种改变使得J(θ0, θ1)变小

27
00:02:05,258 --> 00:02:10,571
until hopefully we wind up at a minimum or
maybe a local minimum. So, let's see
直到我们找到 J 的最小值 或许是局部最小值

28
00:02:10,796 --> 00:02:16,106
see pictures of what gradient descent
does. Let's say I try to minimize this
让我们通过一些图片来看看梯度下降法是如何工作的

29
00:02:16,106 --> 00:02:20,849
function. So notice the axes. This is,
(theta0, theta1) on the horizontal
我在试图让这个函数值最小 注意坐标轴 θ0和θ1在水平轴上

30
00:02:20,849 --> 00:02:25,774
axes, and J is a vertical axis. And so the
height of the surface shows J, and we
而函数 J在垂直坐标轴上 图形表面高度则是 J的值

31
00:02:25,774 --> 00:02:30,582
want to minimize this function. So, we're
going to start off with (theta0, theta1)
我们希望最小化这个函数 所以我们从 θ0和θ1的某个值出发

32
00:02:30,582 --> 00:02:35,375
at some point. So imagine picking some
value for (theta0, theta1), and that
所以想象一下 对 θ0和θ1赋以某个初值

33
00:02:35,375 --> 00:02:39,934
corresponds to starting at some point on
the surface of this function. Okay? So
也就是对应于从这个函数表面上的某个起始点出发 对吧

34
00:02:39,934 --> 00:02:44,201
whatever value of (theta0, theta1)
gives you some point here. I did
所以不管 θ0和θ1的取值是多少

35
00:02:44,201 --> 00:02:48,819
initialize them to 0, but
sometimes you initialize it to other
我将它们初始化为0 但有时你也可把它初始化为其他值

36
00:02:48,819 --> 00:02:53,942
values as well. Now. I want us to imagine
that this figure shows a hill. Imagine
现在我希望大家把这个图像想象为一座山

37
00:02:53,942 --> 00:02:59,178
this is like a landscape of some grassy
park with two hills like so.
想像类似这样的景色 公园中有两座山

38
00:02:59,178 --> 00:03:04,618
And I want you to imagine that you are
physically standing at that point on the
想象一下你正站立在山的这一点上

39
00:03:04,618 --> 00:03:09,990
hill on this little red hill in your park.
In gradient descent, what we're
站立在你想象的公园这座红色山上 在梯度下降算法中

40
00:03:09,990 --> 00:03:15,770
going to do is spin 360 degrees around and
just look all around us and ask, "If I were
我们要做的就是旋转360度 看看我们的周围 并问自己

41
00:03:15,770 --> 00:03:20,423
to take a little baby step in some
direction, and I want to go downhill as
我要在某个方向上 用小碎步尽快下山

42
00:03:20,423 --> 00:03:25,320
quickly as possible, what direction do I
take that little baby step in if I want to
如果我想要下山 如果我想尽快走下山

43
00:03:25,320 --> 00:03:29,686
go down, if I sort of want to physically
walk down this hill as rapidly as
这些小碎步需要朝什么方向?

44
00:03:29,686 --> 00:03:34,465
possible?" Turns out that if we're standing
at that point on the hill, you look all
如果我们站在山坡上的这一点

45
00:03:34,465 --> 00:03:39,185
around, you find that the best direction
to take a little step downhill
你看一下周围 ​​你会发现最佳的下山方向

46
00:03:39,185 --> 00:03:44,035
is roughly that direction. Okay. And now
you're at this new point on your hill.
大约是那个方向 好的 现在你在山上的新起点上

47
00:03:44,035 --> 00:03:49,430
You're going to, again, look all around, and then
say, "What direction should I step in order
你再看看周围 然后再一次想想

48
00:03:49,430 --> 00:03:54,695
to take a little baby step downhill?" And
if you do that and take another step, you
我应该从什么方向迈着小碎步下山? 然后你按照自己的判断又迈出一步

49
00:03:54,695 --> 00:03:59,700
take a step in that direction, and then
you keep going. You know, from this new
往那个方向走了一步 然后重复上面的步骤

50
00:03:59,700 --> 00:04:04,835
point, you look around, decide what
direction will take you downhill most
从这个新的点 你环顾四周 并决定从什么方向将会最快下山

51
00:04:04,835 --> 00:04:09,775
quickly, take another step, another step,
and so on, until you converge to this,
然后又迈进了一小步 又是一小步
并依此类推 直到你接近这里

52
00:04:09,970 --> 00:04:15,059
local minimum down here. Further descent
has an interesting property. This first
直到局部最低点的位置 此外 这种下降有一个有趣的特点

53
00:04:15,059 --> 00:04:19,682
time we ran gradient descent, we were
starting at this point over here, right?
第一次我们是从这个点开始进行梯度下降算法的 是吧

54
00:04:19,682 --> 00:04:24,183
Started at that point over here. Now
imagine, we initialize gradient
在这一点上从这里开始 现在想象一下

55
00:04:24,183 --> 00:04:29,232
descent just a couple steps to the right.
Imagine we initialized gradient descent with
我们在刚才的右边一些的位置 对梯度下降进行初始化

56
00:04:29,232 --> 00:04:34,159
that point on the upper right. If you were
to repeat this process, and stop at the
想象我们在右边高一些的这个点 开始使用梯度下降 如果你重复上述步骤

57
00:04:34,159 --> 00:04:39,207
point, and look all around. Take a little
step in the direction of steepest descent.
停留在该点 并环顾四周 往下降最快的方向迈出一小步

58
00:04:39,207 --> 00:04:44,772
You would do that. Then look around, take
another step, and so on. And if you start
然后环顾四周 又迈出一步 然后如此往复

59
00:04:44,772 --> 00:04:50,570
it just a couple steps to the right, the
gradient descent will have taken you to
如果你从右边不远处开始 梯度下降算法将会带你来到

60
00:04:50,570 --> 00:04:56,236
this second local optimum over on the
right. So if you had started at this first
这个右边的第二个局部最优处 如果从刚才的第一个点出发

61
00:04:56,236 --> 00:05:01,602
point, you would have wound up at this
local optimum. But if you started just a
你会得到这个局部最优解 但如果你的起始点偏移了一些

62
00:05:01,602 --> 00:05:06,762
little bit, a slightly different location,
you would have wound up at a very
起始点的位置略有不同 你会得到一个

63
00:05:06,762 --> 00:05:12,197
different local optimum. And this is a
property of gradient descent that we'll
非常不同的局部最优解 这就是梯度下降算法的一个特点

64
00:05:12,197 --> 00:05:17,425
say a little bit more about later. So,
that's the intuition in pictures. Let's
我们会在之后继续探讨这个问题 好的 这是我们从图中得到的直观感受

65
00:05:17,425 --> 00:05:22,929
look at the map. This is the definition of
the gradient descent algorithm. We're
看看这个图 这是梯度下降算法的定义

66
00:05:22,929 --> 00:05:28,240
going to just repeatedly do this. On to
convergence. We're going to update my
我们将会反复做这些 直到收敛 我们要更新参数 θj

67
00:05:28,240 --> 00:05:33,543
parameter theta j by, you know, taking
theta j and subtracting from it alpha
方法是 用 θj 减去 α乘以这一部分

68
00:05:33,543 --> 00:05:39,129
times this term over here. So, let's
see. There are a lot of details in this
让我们来看看 这个公式有很多细节问题

69
00:05:39,129 --> 00:05:45,030
equation, so let me unpack some of it.
First, this notation here, colon equals.
我来详细讲解一下 首先 注意这个符号 :=

70
00:05:45,030 --> 00:05:51,643
We're going to use := to denote
assignment, so it's the assignment
我们使用 := 表示赋值 这是一个赋值运算符

71
00:05:51,643 --> 00:05:57,790
operator. So concretely, if I
write A: =B, what this means in
具体地说 如果我写 a:= b

72
00:05:57,790 --> 00:06:02,878
a computer, this means take the
value in B and use it to overwrite
在计算机专业内 这意味着不管 a的值是什么

73
00:06:02,878 --> 00:06:08,517
whatever the value of A. So this means we
will set A to be equal to the value of B.
取 b的值 并将其赋给a 这意味着我们让 a等于b的值

74
00:06:08,517 --> 00:06:13,674
Okay, it's assignment. I can
also do A:=A+1. This means
这就是赋值 我也可以做 a:= a+1

75
00:06:13,674 --> 00:06:18,969
take A and increase its value by one.
Whereas in contrast, if I use the equals
这意味着 取出a值 并将其增加1 与此不同的是

76
00:06:18,969 --> 00:06:26,067
sign and I write A=B, then this is a
truth assertion. So if I write A=B,
如果我使用等号 = 并且写出a=b 那么这是一个判断为真的声明

77
00:06:26,067 --> 00:06:31,006
then I'm asserting that the value of A
equals to the value of B. So the left hand
如果我写 a=b 就是在断言 a的值是等于 b的值的

78
00:06:31,006 --> 00:06:36,331
side, that's a computer operation, where
you set the value of A to be a value. The
在左边这里 这是计算机运算 将一个值赋给 a

79
00:06:36,331 --> 00:06:41,399
right hand side, this is asserting, I'm
just making a claim that the values of A
而在右边这里 这是声明 声明 a的值 与b的值相同

80
00:06:41,399 --> 00:06:46,274
and B are the same. And so, whereas I can
write A: =A+1, that means increment A by
因此 我可以写 a:=a+1 这意味着 将 a的值再加上1

81
00:06:46,274 --> 00:06:50,764
1. Hopefully, I won't ever write A=A+1.
Because that's just wrong.
但我不会写 a=a+1 因为这本来就是错误的

82
00:06:50,764 --> 00:06:55,704
A and A+1 can never be equal to
the same values. So that's the first
a 和 a+1 永远不会是同一个值 这是这个定义的第一个部分

83
00:06:55,704 --> 00:07:05,733
part of the definition. This alpha
here is a number that is called the
这里的α 是一个数字 被称为学习速率

84
00:07:05,733 --> 00:07:12,360
learning rate. And what alpha does is, it
basically controls how big a step we take
什么是α呢?  在梯度下降算法中 它控制了

85
00:07:12,360 --> 00:07:17,113
downhill with gradient descent. So if
alpha is very large, then that corresponds
我们下山时会迈出多大的步子 因此如果 α值很大

86
00:07:17,113 --> 00:07:21,925
to a very aggressive gradient descent
procedure, where we're trying to take huge
那么相应的梯度下降过程中 我们会试图用大步子下山

87
00:07:21,925 --> 00:07:26,322
steps downhill. And if alpha is very
small, then we're taking little, little
如果α值很小 那么我们会迈着很小的小碎步下山

88
00:07:26,322 --> 00:07:31,194
baby steps downhill. And, I'll come
back and say more about this later.
关于如何设置 α的值等内容 在之后的课程中

89
00:07:31,194 --> 00:07:35,660
About how to set alpha and so on.
And finally, this term here. That's the
我会回到这里并且详细说明 最后 是公式的这一部分

90
00:07:35,660 --> 00:07:40,582
derivative term, I don't want to talk
about it right now, but I will derive this
这是一个微分项 我现在不想谈论它

91
00:07:40,582 --> 00:07:45,564
derivative term and tell you exactly what
this is based on. And some of you
但我会推导出这个微分项 并告诉你到底这要如何计算

92
00:07:45,564 --> 00:07:50,547
will be more familiar with calculus than
others, but even if you aren't familiar
你们中有人大概比较熟悉微积分 但即使你不熟悉微积分

93
00:07:50,547 --> 00:07:55,469
with calculus don't worry about it, I'll
tell you what you need to know about this
也不用担心 我会告诉你 对这一项 你最后需要做什么

94
00:07:55,469 --> 00:08:00,580
term here. Now there's one more subtlety
about gradient descent which is, in
现在 在梯度下降算法中 还有一个更微妙的问题

95
00:08:00,580 --> 00:08:05,837
gradient descent, we're going to
update theta0 and theta1. So
在梯度下降中 我们要更新 θ0和θ1

96
00:08:05,837 --> 00:08:10,699
this update takes place where j=0, and
j=1. So you're going to update j, theta0,
当 j=0 和 j=1 时 会产生更新 所以你将更新 J θ0还有θ1

97
00:08:10,699 --> 00:08:15,955
and update theta1. And the subtlety of
how you implement gradient descent is,
实现梯度下降算法的微妙之处是

98
00:08:15,955 --> 00:08:21,562
for this expression, for this
update equation, you want to
在这个表达式中 如果你要更新这个等式

99
00:08:21,562 --> 00:08:31,384
simultaneously update theta0 and
theta1. What I mean by that is
你需要同时更新 θ0和θ1

100
00:08:31,384 --> 00:08:36,432
that in this equation,
we're going to update
我的意思是在这个等式中 我们要这样更新

101
00:08:36,432 --> 00:08:40,975
theta0:=theta0 - something, and update
theta1:=theta1 - something.
θ0:=θ0 - 一些东西 并更新 θ1:=θ1 - 一些东西

102
00:08:40,975 --> 00:08:45,834
And the way to implement this is,
you should compute the right hand
实现方法是 你应该计算公式右边的部分

103
00:08:45,834 --> 00:08:52,677
side. Compute that thing for both theta0
and theta1, and then simultaneously at
通过那一部分计算出θ0和θ1的值

104
00:08:52,677 --> 00:08:57,469
the same time update theta0 and
theta1. So let me say what I mean
然后同时更新 θ0和θ1 让我进一步阐述这个过程

105
00:08:57,469 --> 00:09:02,024
by that. This is a correct implementation
of gradient descent meaning simultaneous
在梯度下降算法中 这是正确实现同时更新的方法

106
00:09:02,024 --> 00:09:06,461
updates. I'm going to set temp0 equals
that, set temp1 equals that. So basically
我要设 temp0等于这些 设temp1等于那些

107
00:09:06,461 --> 00:09:11,430
compute the right hand sides. And then having
computed the right hand sides and stored
所以首先计算出公式右边这一部分 然后将计算出的结果

108
00:09:11,430 --> 00:09:15,926
them together in temp0 and temp1,
I'm going to update theta0 and theta1
一起存入 temp0和 temp1 之中 然后同时更新 θ0和θ1

109
00:09:15,926 --> 00:09:20,245
simultaneously, because that's the
correct implementation. In contrast,
因为这才是正确的实现方法

110
00:09:20,245 --> 00:09:25,533
here's an incorrect implementation that
does not do a simultaneous update. So in
与此相反 下面是不正确的实现方法 因为它没有做到同步更新

111
00:09:25,533 --> 00:09:31,666
this incorrect implementation, we compute
temp0, and then we update theta0
在这种不正确的实现方法中 我们计算 temp0 然后我们更新θ0

112
00:09:31,666 --> 00:09:36,644
and then we compute temp1. Then we
update temp1. And the difference between
然后我们计算 temp1 然后我们将 temp1 赋给θ1

113
00:09:36,644 --> 00:09:41,877
the right hand side and the left hand side
implementations is that if we look down
右边的方法和左边的区别是 让我们看这里

114
00:09:41,877 --> 00:09:46,791
here, you look at this step, if by this
time you've already updated theta0
就是这一步 如果这个时候你已经更新了θ0

115
00:09:46,791 --> 00:09:51,897
then you would be using the new
value of theta0 to compute this
那么你会使用 θ0的新的值来计算这个微分项

116
00:09:51,897 --> 00:09:57,340
derivative term and so this gives you a
different value of temp1 than the left
所以由于你已经在这个公式中使用了新的 θ0的值

117
00:09:57,340 --> 00:10:01,565
hand side, because you've now
plugged in the new value of theta0
那么这会产生一个与左边不同的 temp1的值

118
00:10:01,565 --> 00:10:05,852
into this equation. And so this on right
hand side is not a correct implementation
所以右边并不是正确地实现梯度下降的做法

119
00:10:05,852 --> 00:10:09,916
of gradient descent. So I don't
want to say why you need to do the
我不打算解释为什么你需要同时更新

120
00:10:09,916 --> 00:10:14,617
simultaneous updates, it turns
out that the way gradient descent
同时更新是梯度下降中的一种常用方法

121
00:10:14,617 --> 00:10:18,735
is usually implemented, we'll say more
about it later, it actually turns out to
我们之后会讲到

122
00:10:18,735 --> 00:10:22,496
be more natural to implement the
simultaneous update. And when people talk
实际上同步更新是更自然的实现方法

123
00:10:22,496 --> 00:10:26,665
about gradient descent, they always mean
simultaneous update. If you implement the
当人们谈到梯度下降时 他们的意思就是同步更新

124
00:10:26,665 --> 00:10:30,630
non-simultaneous update, it turns out
it will probably work anyway, but this
如果用非同步更新去实现算法 代码可能也会正确工作

125
00:10:30,630 --> 00:10:34,747
algorithm on the right is not what people
people refer to as gradient descent and
但是右边的方法并不是人们所指的那个梯度下降算法

126
00:10:34,747 --> 00:10:38,356
this is some other algorithm with
different properties. And for various
而是具有不同性质的其他算法

127
00:10:38,356 --> 00:10:42,220
reasons, this can behave in
slightly stranger ways. And
由于各种原因 这其中会表现出微小的差别

128
00:10:42,220 --> 00:10:46,626
what you should do is to really
implement the simultaneous update of
你应该做的是 在梯度下降中真正实现同时更新

129
00:10:46,626 --> 00:10:52,313
gradient descent. So, that's the outline of the
gradient descent algorithm. In the next video,
这些就是梯度下降算法的梗概

130
00:10:52,313 --> 00:10:56,998
we're going to go into the details of the
derivative term, which I wrote out but
在接下来的视频中 我们要进入这个微分项的细节之中

131
00:10:56,998 --> 00:11:01,799
didn't really define. And if you've taken
a calculus class before and if you're
我已经写了出来但没有真正定义 如果你已经修过微积分课程

132
00:11:01,799 --> 00:11:06,367
familiar with partial derivatives and
derivatives, it turns out that's exactly
如果你熟悉偏导数和导数 这其实就是这个微分项

133
00:11:06,367 --> 00:11:11,425
what that derivative term is. But in case
you aren't familiar with calculus, don't
如果你不熟悉微积分 不用担心

134
00:11:11,425 --> 00:11:15,680
worry about it. The next video will give
you all the intuitions and will tell you
即使你之前没有看过微积分

135
00:11:15,680 --> 00:11:19,883
everything you need to know to compute
that derivative term, even if you haven't
或者没有接触过偏导数 在接下来的视频中

136
00:11:19,883 --> 00:11:24,296
seen calculus, or even if you haven't seen
partial derivatives before. And with
你会得到一切你需要知道的 如何计算这个微分项的知识

137
00:11:24,296 --> 00:11:28,288
that, with the next video, hopefully,
we'll be able to give all the intuitions
下一个视频中 希望我们能够给出

138
00:11:28,288 --> 00:11:30,180
you need to apply gradient descent.
实现梯度下降算法的所有知识
【果壳教育无边界字幕组】翻译：10号少年  校对：小白_远游 审核：所罗门捷列夫