forked from pytorch/tutorials
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathdeep_learning_nlp.py
1642 lines (1399 loc) · 65.7 KB
/
deep_learning_nlp.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
"""
Deep Learning for Natural Language Processing with Pytorch
**********************************************************
**Author**: `Robert Guthrie <https://github.com/rguthrie3/DeepLearningForNLPInPytorch>`_
This tutorial will walk you through the key ideas of deep learning
programming using Pytorch. Many of the concepts (such as the computation
graph abstraction and autograd) are not unique to Pytorch and are
relevant to any deep learning tool kit out there.
I am writing this tutorial to focus specifically on NLP for people who
have never written code in any deep learning framework (e.g, TensorFlow,
Theano, Keras, Dynet). It assumes working knowledge of core NLP
problems: part-of-speech tagging, language modeling, etc. It also
assumes familiarity with neural networks at the level of an intro AI
class (such as one from the Russel and Norvig book). Usually, these
courses cover the basic backpropagation algorithm on feed-forward neural
networks, and make the point that they are chains of compositions of
linearities and non-linearities. This tutorial aims to get you started
writing deep learning code, given you have this prerequisite knowledge.
Note this is about *models*, not data. For all of the models, I just
create a few test examples with small dimensionality so you can see how
the weights change as it trains. If you have some real data you want to
try, you should be able to rip out any of the models from this notebook
and use them on it.
"""
import torch
import torch.autograd as autograd
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
torch.manual_seed(1)
######################################################################
# 1. Introduction to Torch's tensor library
# =========================================
#
######################################################################
# All of deep learning is computations on tensors, which are
# generalizations of a matrix that can be indexed in more than 2
# dimensions. We will see exactly what this means in-depth later. First,
# lets look what we can do with tensors.
#
######################################################################
# Creating Tensors
# ~~~~~~~~~~~~~~~~
#
# Tensors can be created from Python lists with the torch.Tensor()
# function.
#
# Create a torch.Tensor object with the given data. It is a 1D vector
V_data = [1., 2., 3.]
V = torch.Tensor(V_data)
print(V)
# Creates a matrix
M_data = [[1., 2., 3.], [4., 5., 6]]
M = torch.Tensor(M_data)
print(M)
# Create a 3D tensor of size 2x2x2.
T_data = [[[1., 2.], [3., 4.]],
[[5., 6.], [7., 8.]]]
T = torch.Tensor(T_data)
print(T)
######################################################################
# What is a 3D tensor anyway? Think about it like this. If you have a
# vector, indexing into the vector gives you a scalar. If you have a
# matrix, indexing into the matrix gives you a vector. If you have a 3D
# tensor, then indexing into the tensor gives you a matrix!
#
# A note on terminology:
# when I say "tensor" in this tutorial, it refers
# to any torch.Tensor object. Matrices and vectors are special cases of
# torch.Tensors, where their dimension is 1 and 2 respectively. When I am
# talking about 3D tensors, I will explicitly use the term "3D tensor".
#
# Index into V and get a scalar
print(V[0])
# Index into M and get a vector
print(M[0])
# Index into T and get a matrix
print(T[0])
######################################################################
# You can also create tensors of other datatypes. The default, as you can
# see, is Float. To create a tensor of integer types, try
# torch.LongTensor(). Check the documentation for more data types, but
# Float and Long will be the most common.
#
######################################################################
# You can create a tensor with random data and the supplied dimensionality
# with torch.randn()
#
x = torch.randn((3, 4, 5))
print(x)
######################################################################
# Operations with Tensors
# ~~~~~~~~~~~~~~~~~~~~~~~
#
# You can operate on tensors in the ways you would expect.
x = torch.Tensor([1., 2., 3.])
y = torch.Tensor([4., 5., 6.])
z = x + y
print(z)
######################################################################
# See `the documentation <http://pytorch.org/docs/torch.html>`__ for a
# complete list of the massive number of operations available to you. They
# expand beyond just mathematical operations.
#
# One helpful operation that we will make use of later is concatenation.
#
# By default, it concatenates along the first axis (concatenates rows)
x_1 = torch.randn(2, 5)
y_1 = torch.randn(3, 5)
z_1 = torch.cat([x_1, y_1])
print(z_1)
# Concatenate columns:
x_2 = torch.randn(2, 3)
y_2 = torch.randn(2, 5)
# second arg specifies which axis to concat along
z_2 = torch.cat([x_2, y_2], 1)
print(z_2)
# If your tensors are not compatible, torch will complain. Uncomment to see the error
# torch.cat([x_1, x_2])
######################################################################
# Reshaping Tensors
# ~~~~~~~~~~~~~~~~~
#
# Use the .view() method to reshape a tensor. This method receives heavy
# use, because many neural network components expect their inputs to have
# a certain shape. Often you will need to reshape before passing your data
# to the component.
#
x = torch.randn(2, 3, 4)
print(x)
print(x.view(2, 12)) # Reshape to 2 rows, 12 columns
# Same as above. If one of the dimensions is -1, its size can be inferred
print(x.view(2, -1))
######################################################################
# 2. Computation Graphs and Automatic Differentiation
# ===================================================
#
######################################################################
# The concept of a computation graph is essential to efficient deep
# learning programming, because it allows you to not have to write the
# back propagation gradients yourself. A computation graph is simply a
# specification of how your data is combined to give you the output. Since
# the graph totally specifies what parameters were involved with which
# operations, it contains enough information to compute derivatives. This
# probably sounds vague, so lets see what is going on using the
# fundamental class of Pytorch: autograd.Variable.
#
# First, think from a programmers perspective. What is stored in the
# torch.Tensor objects we were creating above? Obviously the data and the
# shape, and maybe a few other things. But when we added two tensors
# together, we got an output tensor. All this output tensor knows is its
# data and shape. It has no idea that it was the sum of two other tensors
# (it could have been read in from a file, it could be the result of some
# other operation, etc.)
#
# The Variable class keeps track of how it was created. Lets see it in
# action.
#
# Variables wrap tensor objects
x = autograd.Variable(torch.Tensor([1., 2., 3]), requires_grad=True)
# You can access the data with the .data attribute
print(x.data)
# You can also do all the same operations you did with tensors with Variables.
y = autograd.Variable(torch.Tensor([4., 5., 6]), requires_grad=True)
z = x + y
print(z.data)
# BUT z knows something extra.
print(z.creator)
######################################################################
# So Variables know what created them. z knows that it wasn't read in from
# a file, it wasn't the result of a multiplication or exponential or
# whatever. And if you keep following z.creator, you will find yourself at
# x and y.
#
# But how does that help us compute a gradient?
#
# Lets sum up all the entries in z
s = z.sum()
print(s)
print(s.creator)
######################################################################
# So now, what is the derivative of this sum with respect to the first
# component of x? In math, we want
#
# .. math::
#
# \frac{\partial s}{\partial x_0}
#
#
#
# Well, s knows that it was created as a sum of the tensor z. z knows
# that it was the sum x + y. So
#
# .. math:: s = \overbrace{x_0 + y_0}^\text{$z_0$} + \overbrace{x_1 + y_1}^\text{$z_1$} + \overbrace{x_2 + y_2}^\text{$z_2$}
#
# And so s contains enough information to determine that the derivative
# we want is 1!
#
# Of course this glosses over the challenge of how to actually compute
# that derivative. The point here is that s is carrying along enough
# information that it is possible to compute it. In reality, the
# developers of Pytorch program the sum() and + operations to know how to
# compute their gradients, and run the back propagation algorithm. An
# in-depth discussion of that algorithm is beyond the scope of this
# tutorial.
#
######################################################################
# Lets have Pytorch compute the gradient, and see that we were right:
# (note if you run this block multiple times, the gradient will increment.
# That is because Pytorch *accumulates* the gradient into the .grad
# property, since for many models this is very convenient.)
#
# calling .backward() on any variable will run backprop, starting from it.
s.backward()
print(x.grad)
######################################################################
# Understanding what is going on in the block below is crucial for being a
# successful programmer in deep learning.
#
x = torch.randn((2, 2))
y = torch.randn((2, 2))
z = x + y # These are Tensor types, and backprop would not be possible
var_x = autograd.Variable(x)
var_y = autograd.Variable(y)
# var_z contains enough information to compute gradients, as we saw above
var_z = var_x + var_y
print(var_z.creator)
var_z_data = var_z.data # Get the wrapped Tensor object out of var_z...
# Re-wrap the tensor in a new variable
new_var_z = autograd.Variable(var_z_data)
# ... does new_var_z have information to backprop to x and y?
# NO!
print(new_var_z.creator)
# And how could it? We yanked the tensor out of var_z (that is
# what var_z.data is). This tensor doesn't know anything about
# how it was computed. We pass it into new_var_z, and this is all the
# information new_var_z gets. If var_z_data doesn't know how it was
# computed, theres no way new_var_z will.
# In essence, we have broken the variable away from its past history
######################################################################
# Here is the basic, extremely important rule for computing with
# autograd.Variables (note this is more general than Pytorch. There is an
# equivalent object in every major deep learning toolkit):
#
# **If you want the error from your loss function to backpropogate to a
# component of your network, you MUST NOT break the Variable chain from
# that component to your loss Variable. If you do, the loss will have no
# idea your component exists, and its parameters can't be updated.**
#
# I say this in bold, because this error can creep up on you in very
# subtle ways (I will show some such ways below), and it will not cause
# your code to crash or complain, so you must be careful.
#
######################################################################
# 3. Deep Learning Building Blocks: Affine maps, non-linearities and objectives
# =============================================================================
#
######################################################################
# Deep learning consists of composing linearities with non-linearities in
# clever ways. The introduction of non-linearities allows for powerful
# models. In this section, we will play with these core components, make
# up an objective function, and see how the model is trained.
#
######################################################################
# Affine Maps
# ~~~~~~~~~~~
#
# One of the core workhorses of deep learning is the affine map, which is
# a function :math:`f(x)` where
#
# .. math:: f(x) = Ax + b
#
# for a matrix :math:`A` and vectors :math:`x, b`. The parameters to be
# learned here are :math:`A` and :math:`b`. Often, :math:`b` is refered to
# as the *bias* term.
#
######################################################################
# Pytorch and most other deep learning frameworks do things a little
# differently than traditional linear algebra. It maps the rows of the
# input instead of the columns. That is, the :math:`i`'th row of the
# output below is the mapping of the :math:`i`'th row of the input under
# :math:`A`, plus the bias term. Look at the example below.
#
lin = nn.Linear(5, 3) # maps from R^5 to R^3, parameters A, b
# data is 2x5. A maps from 5 to 3... can we map "data" under A?
data = autograd.Variable(torch.randn(2, 5))
print(lin(data)) # yes
######################################################################
# Non-Linearities
# ~~~~~~~~~~~~~~~
#
# First, note the following fact, which will explain why we need
# non-linearities in the first place. Suppose we have two affine maps
# :math:`f(x) = Ax + b` and :math:`g(x) = Cx + d`. What is
# :math:`f(g(x))`?
#
# .. math:: f(g(x)) = A(Cx + d) + b = ACx + (Ad + b)
#
# :math:`AC` is a matrix and :math:`Ad + b` is a vector, so we see that
# composing affine maps gives you an affine map.
#
# From this, you can see that if you wanted your neural network to be long
# chains of affine compositions, that this adds no new power to your model
# than just doing a single affine map.
#
# If we introduce non-linearities in between the affine layers, this is no
# longer the case, and we can build much more powerful models.
#
# There are a few core non-linearities.
# :math:`\tanh(x), \sigma(x), \text{ReLU}(x)` are the most common. You are
# probably wondering: "why these functions? I can think of plenty of other
# non-linearities." The reason for this is that they have gradients that
# are easy to compute, and computing gradients is essential for learning.
# For example
#
# .. math:: \frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x))
#
# A quick note: although you may have learned some neural networks in your
# intro to AI class where :math:`\sigma(x)` was the default non-linearity,
# typically people shy away from it in practice. This is because the
# gradient *vanishes* very quickly as the absolute value of the argument
# grows. Small gradients means it is hard to learn. Most people default to
# tanh or ReLU.
#
# In pytorch, most non-linearities are in torch.functional (we have it imported as F)
# Note that non-linearites typically don't have parameters like affine maps do.
# That is, they don't have weights that are updated during training.
data = autograd.Variable(torch.randn(2, 2))
print(data)
print(F.relu(data))
######################################################################
# Softmax and Probabilities
# ~~~~~~~~~~~~~~~~~~~~~~~~~
#
# The function :math:`\text{Softmax}(x)` is also just a non-linearity, but
# it is special in that it usually is the last operation done in a
# network. This is because it takes in a vector of real numbers and
# returns a probability distribution. Its definition is as follows. Let
# :math:`x` be a vector of real numbers (positive, negative, whatever,
# there are no constraints). Then the i'th component of
# :math:`\text{Softmax}(x)` is
#
# .. math:: \frac{\exp(x_i)}{\sum_j \exp(x_j)}
#
# It should be clear that the output is a probability distribution: each
# element is non-negative and the sum over all components is 1.
#
# You could also think of it as just applying an element-wise
# exponentiation operator to the input to make everything non-negative and
# then dividing by the normalization constant.
#
# Softmax is also in torch.functional
data = autograd.Variable(torch.randn(5))
print(data)
print(F.softmax(data))
print(F.softmax(data).sum()) # Sums to 1 because it is a distribution!
print(F.log_softmax(data)) # theres also log_softmax
######################################################################
# Objective Functions
# ~~~~~~~~~~~~~~~~~~~
#
# The objective function is the function that your network is being
# trained to minimize (in which case it is often called a *loss function*
# or *cost function*). This proceeds by first choosing a training
# instance, running it through your neural network, and then computing the
# loss of the output. The parameters of the model are then updated by
# taking the derivative of the loss function. Intuitively, if your model
# is completely confident in its answer, and its answer is wrong, your
# loss will be high. If it is very confident in its answer, and its answer
# is correct, the loss will be low.
#
# The idea behind minimizing the loss function on your training examples
# is that your network will hopefully generalize well and have small loss
# on unseen examples in your dev set, test set, or in production. An
# example loss function is the *negative log likelihood loss*, which is a
# very common objective for multi-class classification. For supervised
# multi-class classification, this means training the network to minimize
# the negative log probability of the correct output (or equivalently,
# maximize the log probability of the correct output).
#
######################################################################
# 4. Optimization and Training
# ============================
#
######################################################################
# So what we can compute a loss function for an instance? What do we do
# with that? We saw earlier that autograd.Variable's know how to compute
# gradients with respect to the things that were used to compute it. Well,
# since our loss is an autograd.Variable, we can compute gradients with
# respect to all of the parameters used to compute it! Then we can perform
# standard gradient updates. Let :math:`\theta` be our parameters,
# :math:`L(\theta)` the loss function, and :math:`\eta` a positive
# learning rate. Then:
#
# .. math:: \theta^{(t+1)} = \theta^{(t)} - \eta \nabla_\theta L(\theta)
#
# There are a huge collection of algorithms and active research in
# attempting to do something more than just this vanilla gradient update.
# Many attempt to vary the learning rate based on what is happening at
# train time. You don't need to worry about what specifically these
# algorithms are doing unless you are really interested. Torch provies
# many in the torch.optim package, and they are all completely
# transparent. Using the simplest gradient update is the same as the more
# complicated algorithms. Trying different update algorithms and different
# parameters for the update algorithms (like different initial learning
# rates) is important in optimizing your network's performance. Often,
# just replacing vanilla SGD with an optimizer like Adam or RMSProp will
# boost performance noticably.
#
######################################################################
# 5. Creating Network Components in Pytorch
# =========================================
#
# Before we move on to our focus on NLP, lets do an annotated example of
# building a network in Pytorch using only affine maps and
# non-linearities. We will also see how to compute a loss function, using
# Pytorch's built in negative log likelihood, and update parameters by
# backpropagation.
#
# All network components should inherit from nn.Module and override the
# forward() method. That is about it, as far as the boilerplate is
# concerned. Inheriting from nn.Module provides functionality to your
# component. For example, it makes it keep track of its trainable
# parameters, you can swap it between CPU and GPU with the .cuda() or
# .cpu() functions, etc.
#
# Let's write an annotated example of a network that takes in a sparse
# bag-of-words representation and outputs a probability distribution over
# two labels: "English" and "Spanish". This model is just logistic
# regression.
#
######################################################################
# Example: Logistic Regression Bag-of-Words classifier
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# Our model will map a sparse BOW representation to log probabilities over
# labels. We assign each word in the vocab an index. For example, say our
# entire vocab is two words "hello" and "world", with indices 0 and 1
# respectively. The BoW vector for the sentence "hello hello hello hello"
# is
#
# .. math:: \left[ 4, 0 \right]
#
# For "hello world world hello", it is
#
# .. math:: \left[ 2, 2 \right]
#
# etc. In general, it is
#
# .. math:: \left[ \text{Count}(\text{hello}), \text{Count}(\text{world}) \right]
#
# Denote this BOW vector as :math:`x`. The output of our network is:
#
# .. math:: \log \text{Softmax}(Ax + b)
#
# That is, we pass the input through an affine map and then do log
# softmax.
#
data = [("me gusta comer en la cafeteria".split(), "SPANISH"),
("Give it to me".split(), "ENGLISH"),
("No creo que sea una buena idea".split(), "SPANISH"),
("No it is not a good idea to get lost at sea".split(), "ENGLISH")]
test_data = [("Yo creo que si".split(), "SPANISH"),
("it is lost on me".split(), "ENGLISH")]
# word_to_ix maps each word in the vocab to a unique integer, which will be its
# index into the Bag of words vector
word_to_ix = {}
for sent, _ in data + test_data:
for word in sent:
if word not in word_to_ix:
word_to_ix[word] = len(word_to_ix)
print(word_to_ix)
VOCAB_SIZE = len(word_to_ix)
NUM_LABELS = 2
class BoWClassifier(nn.Module): # inheriting from nn.Module!
def __init__(self, num_labels, vocab_size):
# calls the init function of nn.Module. Dont get confused by syntax,
# just always do it in an nn.Module
super(BoWClassifier, self).__init__()
# Define the parameters that you will need. In this case, we need A and b,
# the parameters of the affine mapping.
# Torch defines nn.Linear(), which provides the affine map.
# Make sure you understand why the input dimension is vocab_size
# and the output is num_labels!
self.linear = nn.Linear(vocab_size, num_labels)
# NOTE! The non-linearity log softmax does not have parameters! So we don't need
# to worry about that here
def forward(self, bow_vec):
# Pass the input through the linear layer,
# then pass that through log_softmax.
# Many non-linearities and other functions are in torch.nn.functional
return F.log_softmax(self.linear(bow_vec))
def make_bow_vector(sentence, word_to_ix):
vec = torch.zeros(len(word_to_ix))
for word in sentence:
vec[word_to_ix[word]] += 1
return vec.view(1, -1)
def make_target(label, label_to_ix):
return torch.LongTensor([label_to_ix[label]])
model = BoWClassifier(NUM_LABELS, VOCAB_SIZE)
# the model knows its parameters. The first output below is A, the second is b.
# Whenever you assign a component to a class variable in the __init__ function
# of a module, which was done with the line
# self.linear = nn.Linear(...)
# Then through some Python magic from the Pytorch devs, your module
#(in this case, BoWClassifier) will store knowledge of the nn.Linear's parameters
for param in model.parameters():
print(param)
# To run the model, pass in a BoW vector, but wrapped in an autograd.Variable
sample = data[0]
bow_vector = make_bow_vector(sample[0], word_to_ix)
log_probs = model(autograd.Variable(bow_vector))
print(log_probs)
######################################################################
# Which of the above values corresponds to the log probability of ENGLISH,
# and which to SPANISH? We never defined it, but we need to if we want to
# train the thing.
#
label_to_ix = {"SPANISH": 0, "ENGLISH": 1}
######################################################################
# So lets train! To do this, we pass instances through to get log
# probabilities, compute a loss function, compute the gradient of the loss
# function, and then update the parameters with a gradient step. Loss
# functions are provided by Torch in the nn package. nn.NLLLoss() is the
# negative log likelihood loss we want. It also defines optimization
# functions in torch.optim. Here, we will just use SGD.
#
# Note that the *input* to NLLLoss is a vector of log probabilities, and a
# target label. It doesn't compute the log probabilities for us. This is
# why the last layer of our network is log softmax. The loss function
# nn.CrossEntropyLoss() is the same as NLLLoss(), except it does the log
# softmax for you.
#
# Run on test data before we train, just to see a before-and-after
for instance, label in test_data:
bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
log_probs = model(bow_vec)
print(log_probs)
# Print the matrix column corresponding to "creo"
print(next(model.parameters())[:, word_to_ix["creo"]])
loss_function = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
# Usually you want to pass over the training data several times.
# 100 is much bigger than on a real data set, but real datasets have more than
# two instances. Usually, somewhere between 5 and 30 epochs is reasonable.
for epoch in range(100):
for instance, label in data:
# Step 1. Remember that Pytorch accumulates gradients.
# We need to clear them out before each instance
model.zero_grad()
# Step 2. Make our BOW vector and also we must wrap the target in a
# Variable as an integer. For example, if the target is SPANISH, then
# we wrap the integer 0. The loss function then knows that the 0th
# element of the log probabilities is the log probability
# corresponding to SPANISH
bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
target = autograd.Variable(make_target(label, label_to_ix))
# Step 3. Run our forward pass.
log_probs = model(bow_vec)
# Step 4. Compute the loss, gradients, and update the parameters by
# calling optimizer.step()
loss = loss_function(log_probs, target)
loss.backward()
optimizer.step()
for instance, label in test_data:
bow_vec = autograd.Variable(make_bow_vector(instance, word_to_ix))
log_probs = model(bow_vec)
print(log_probs)
# Index corresponding to Spanish goes up, English goes down!
print(next(model.parameters())[:, word_to_ix["creo"]])
######################################################################
# We got the right answer! You can see that the log probability for
# Spanish is much higher in the first example, and the log probability for
# English is much higher in the second for the test data, as it should be.
#
# Now you see how to make a Pytorch component, pass some data through it
# and do gradient updates. We are ready to dig deeper into what deep NLP
# has to offer.
#
######################################################################
# 6. Word Embeddings: Encoding Lexical Semantics
# ==============================================
#
######################################################################
# Word embeddings are dense vectors of real numbers, one per word in your
# vocabulary. In NLP, it is almost always the case that your features are
# words! But how should you represent a word in a computer? You could
# store its ascii character representation, but that only tells you what
# the word *is*, it doesn't say much about what it *means* (you might be
# able to derive its part of speech from its affixes, or properties from
# its capitalization, but not much). Even more, in what sense could you
# combine these representations? We often want dense outputs from our
# neural networks, where the inputs are :math:`|V|` dimensional, where
# :math:`V` is our vocabulary, but often the outputs are only a few
# dimensional (if we are only predicting a handful of labels, for
# instance). How do we get from a massive dimensional space to a smaller
# dimensional space?
#
# How about instead of ascii representations, we use a one-hot encoding?
# That is, we represent the word :math:`w` by
#
# .. math:: \overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}
#
# where the 1 is in a location unique to :math:`w`. Any other word will
# have a 1 in some other location, and a 0 everywhere else.
#
# There is an enormous drawback to this representation, besides just how
# huge it is. It basically treats all words as independent entities with
# no relation to each other. What we really want is some notion of
# *similarity* between words. Why? Let's see an example.
#
######################################################################
# Suppose we are building a language model. Suppose we have seen the
# sentences
#
# * The mathematician ran to the store.
# * The physicist ran to the store.
# * The mathematician solved the open problem.
#
# in our training data. Now suppose we get a new sentence never before
# seen in our training data:
#
# * The physicist solved the open problem.
#
# Our language model might do OK on this sentence, but wouldn't it be much
# better if we could use the following two facts:
#
# * We have seen mathematician and physicist in the same role in a sentence. Somehow they
# have a semantic relation.
# * We have seen mathematician in the same role in this new unseen sentence
# as we are now seeing physicist.
#
# and then infer that physicist is actually a good fit in the new unseen
# sentence? This is what we mean by a notion of similarity: we mean
# *semantic similarity*, not simply having similar orthographic
# representations. It is a technique to combat the sparsity of linguistic
# data, by connecting the dots between what we have seen and what we
# haven't. This example of course relies on a fundamental linguistic
# assumption: that words appearing in similar contexts are related to each
# other semantically. This is called the `distributional
# hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.
#
######################################################################
# Getting Dense Word Embeddings
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# How can we solve this problem? That is, how could we actually encode
# semantic similarity in words? Maybe we think up some semantic
# attributes. For example, we see that both mathematicians and physicists
# can run, so maybe we give these words a high score for the "is able to
# run" semantic attribute. Think of some other attributes, and imagine
# what you might score some common words on those attributes.
#
# If each attribute is a dimension, then we might give each word a vector,
# like this:
#
# .. math::
#
# q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
# \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]
#
# .. math::
#
# q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
# \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]
#
# Then we can get a measure of similarity between these words by doing:
#
# .. math:: \text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}
#
# Although it is more common to normalize by the lengths:
#
# .. math::
#
# \text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
# {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)
#
# Where :math:`\phi` is the angle between the two vectors. That way,
# extremely similar words (words whose embeddings point in the same
# direction) will have similarity 1. Extremely dissimilar words should
# have similarity -1.
#
######################################################################
# You can think of the sparse one-hot vectors from the beginning of this
# section as a special case of these new vectors we have defined, where
# each word basically has similarity 0, and we gave each word some unique
# semantic attribute. These new vectors are *dense*, which is to say their
# entries are (typically) non-zero.
#
# But these new vectors are a big pain: you could think of thousands of
# different semantic attributes that might be relevant to determining
# similarity, and how on earth would you set the values of the different
# attributes? Central to the idea of deep learning is that the neural
# network learns representations of the features, rather than requiring
# the programmer to design them herself. So why not just let the word
# embeddings be parameters in our model, and then be updated during
# training? This is exactly what we will do. We will have some *latent
# semantic attributes* that the network can, in principle, learn. Note
# that the word embeddings will probably not be interpretable. That is,
# although with our hand-crafted vectors above we can see that
# mathematicians and physicists are similar in that they both like coffee,
# if we allow a neural network to learn the embeddings and see that both
# mathematicians and physicisits have a large value in the second
# dimension, it is not clear what that means. They are similar in some
# latent semantic dimension, but this probably has no interpretation to
# us.
#
######################################################################
# In summary, **word embeddings are a representation of the *semantics* of
# a word, efficiently encoding semantic information that might be relevant
# to the task at hand**. You can embed other things too: part of speech
# tags, parse trees, anything! The idea of feature embeddings is central
# to the field.
#
######################################################################
# Word Embeddings in Pytorch
# ~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# Before we get to a worked example and an exercise, a few quick notes
# about how to use embeddings in Pytorch and in deep learning programming
# in general. Similar to how we defined a unique index for each word when
# making one-hot vectors, we also need to define an index for each word
# when using embeddings. These will be keys into a lookup table. That is,
# embeddings are stored as a :math:`|V| \times D` matrix, where :math:`D`
# is the dimensionality of the embeddings, such that the word assigned
# index :math:`i` has its embedding stored in the :math:`i`'th row of the
# matrix. In all of my code, the mapping from words to indices is a
# dictionary named word\_to\_ix.
#
# The module that allows you to use embeddings is torch.nn.Embedding,
# which takes two arguments: the vocabulary size, and the dimensionality
# of the embeddings.
#
# To index into this table, you must use torch.LongTensor (since the
# indices are integers, not floats).
#
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5) # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.LongTensor([word_to_ix["hello"]])
hello_embed = embeds(autograd.Variable(lookup_tensor))
print(hello_embed)
######################################################################
# An Example: N-Gram Language Modeling
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
# Recall that in an n-gram language model, given a sequence of words
# :math:`w`, we want to compute
#
# .. math:: P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )
#
# Where :math:`w_i` is the ith word of the sequence.
#
# In this example, we will compute the loss function on some training
# examples and update the parameters with backpropagation.
#
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples. Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])
vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}
class NGramLanguageModeler(nn.Module):
def __init__(self, vocab_size, embedding_dim, context_size):
super(NGramLanguageModeler, self).__init__()
self.embeddings = nn.Embedding(vocab_size, embedding_dim)
self.linear1 = nn.Linear(context_size * embedding_dim, 128)
self.linear2 = nn.Linear(128, vocab_size)
def forward(self, inputs):
embeds = self.embeddings(inputs).view((1, -1))
out = F.relu(self.linear1(embeds))
out = self.linear2(out)
log_probs = F.log_softmax(out)
return log_probs
losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)
for epoch in range(10):
total_loss = torch.Tensor([0])
for context, target in trigrams:
# Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
# into integer indices and wrap them in variables)
context_idxs = [word_to_ix[w] for w in context]
context_var = autograd.Variable(torch.LongTensor(context_idxs))
# Step 2. Recall that torch *accumulates* gradients. Before passing in a new instance,
# you need to zero out the gradients from the old instance
model.zero_grad()
# Step 3. Run the forward pass, getting log probabilities over next
# words
log_probs = model(context_var)
# Step 4. Compute your loss function. (Again, Torch wants the target
# word wrapped in a variable)
loss = loss_function(log_probs, autograd.Variable(
torch.LongTensor([word_to_ix[target]])))
# Step 5. Do the backward pass and update the gradient
loss.backward()
optimizer.step()
total_loss += loss.data
losses.append(total_loss)
print(losses) # The loss decreased every iteration over the training data!
######################################################################
# Exercise: Computing Word Embeddings: Continuous Bag-of-Words
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#
######################################################################
# The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
# learning. It is a model that tries to predict words given the context of
# a few words before and a few words after the target word. This is
# distinct from language modeling, since CBOW is not sequential and does
# not have to be probabilistic. Typcially, CBOW is used to quickly train
# word embeddings, and these embeddings are used to initialize the
# embeddings of some more complicated model. Usually, this is referred to
# as *pretraining embeddings*. It almost always helps performance a couple
# of percent.
#
# The CBOW model is as follows. Given a target word :math:`w_i` and an
# :math:`N` context window on each side, :math:`w_{i-1}, \dots, w_{i-N}`
# and :math:`w_{i+1}, \dots, w_{i+N}`, referring to all context words
# collectively as :math:`C`, CBOW tries to minimize
#
# .. math:: -\log p(w_i | C) = \log \text{Softmax}(A(\sum_{w \in C} q_w) + b)
#
# where :math:`q_w` is the embedding for word :math:`w`.
#
# Implement this model in Pytorch by filling in the class below. Some
# tips:
#
# * Think about which parameters you need to define.
# * Make sure you know what shape each operation expects. Use .view() if you need to