Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ch11 Predicting Movie Reviews - error in back propagation code #50

Open
harpreetmann24 opened this issue Sep 20, 2020 · 9 comments
Open

Comments

@harpreetmann24
Copy link

There seems to be small mistake in the Predicting Movie review code. Here is the code

        x,y = (input_dataset[i],target_dataset[i])
        layer_1 = sigmoid(np.sum(weights_0_1[x],axis=0)) #embed + sigmoid
        layer_2 = sigmoid(np.dot(layer_1,weights_1_2)) # linear + softmax
          
        layer_2_delta = layer_2 - y # compare pred with truth
        layer_1_delta = layer_2_delta.dot(weights_1_2.T) #backprop

        weights_0_1[x] -= layer_1_delta * alpha
        weights_1_2 -= np.outer(layer_1,layer_2_delta) * alpha

Error
In the forward pass, the code apples sigmoid activation function.
Therefore when we calculate layer_1_delta - should we not multiple with derivative of sigmoid?
My understanding was that either we should not apply sigmoid function on layer_1. If we are applying the sigmoid function then in backprop we should multiply with its derivatives.

@Unilmalas
Copy link

Unilmalas commented Mar 21, 2021

I have noticed this too and attempted this code for the weight updates:

 def sigmoidd(x):
             s = sigmoid(x)
             return s * (1 - s)

        dw12 = alpha * np.outer(layer_2_delta, sigmoidd(np.dot(layer_1, weights_1_2)))
        weights_1_2 -= dw12
        weights_0_1[x] -= np.dot(dw12, weights_1_2.T) * sigmoidd(np.sum(weights_0_1[x], axis=0)) # weight updates

This converges much slower, but the similarity comparisons seem to be a better fit.

EDIT: I think I know what has been done: sigmoid(x) = 1/2 + x/4 - x^3/48 +-... so sigmoid'(x) = 1/4 - 3x^2/48 +-...
Drop the 2nd+ terms in the Taylor series derived which leads to the given weight updates (constants 1/4 and 1/16 dropped as they do not matter, the x in the W_0_1 update also dropped as it is considered in the [x] for the one-hot encoding.

W_1_2-=alpha * L2delta * sigmoidd(L1 * W_1_2) * L1 -> alpha * L2delta * L1

W_0_1-=alpha * L1delta * sigmoidd(L1 * W_1_2) * sigmoidd(W_0_1 * x) * x -> alpha * L1delta # x dropped since W_0_1[x]

@JimChengLin
Copy link

I have the same question. I am really a newbie of deep learning. Here is my thought. The most important thing of back prop algorithm is giving previous layer up or down pressure based on the delta. The back prop algorithm can still work without the derivative item. I am not sure how big the impact will be.

My guessing is the author tried the correct one with derivative, but soon the author realized this example worked better without derivative. The same explanation applies to chapter 9. The derivative of softmax is not 1/(batch_size * layer_2.shape[0]).

@JimChengLin
Copy link

I have noticed this too and attempted this code for the weight updates:

 def sigmoidd(x):
             s = sigmoid(x)
             return s * (1 - s)

        dw12 = alpha * np.outer(layer_2_delta, sigmoidd(np.dot(layer_1, weights_1_2)))
        weights_1_2 -= dw12
        weights_0_1[x] -= np.dot(dw12, weights_1_2.T) * sigmoidd(np.sum(weights_0_1[x], axis=0)) # weight updates

This converges much slower, but the similarity comparisons seem to be a better fit.

EDIT: I think I know what has been done: sigmoid(x) = 1/2 + x/4 - x^3/48 +-... so sigmoid'(x) = 1/4 - 3x^2/48 +-...
Drop the 2nd+ terms in the Taylor series derived which leads to the given weight updates (constants 1/4 and 1/16 dropped as they do not matter, the x in the W_0_1 update also dropped as it is considered in the [x] for the one-hot encoding.

W_1_2-=alpha * L2delta * sigmoidd(L1 * W_1_2) * L1 -> alpha * L2delta * L1

W_0_1-=alpha * L1delta * sigmoidd(L1 * W_1_2) * sigmoidd(W_0_1 * x) * x -> alpha * L1delta # x dropped since W_0_1[x]

Do we really need to consider Taylor series? Wouldn't it make things more complicated?

@Unilmalas
Copy link

For me it just explained how the author came up with the approximation used. The full Taylor series is complicated, true, but it is a common approach to use just the first terms and go from there. Works if the functions are reasonably well-behaved.

@JimChengLin
Copy link

For me it just explained how the author came up with the approximation used. The full Taylor series is complicated, true, but it is a common approach to use just the first terms and go from there. Works if the functions are reasonably well-behaved.

Nice insight! May I ask have you ever encountered any problem in chapter 9? How can the author come up the 1/(batch_size * layer_2.shape[0]) term?

@Unilmalas
Copy link

I am not sure about that either. The derivative is stated on page 173 and given that the code makes sense. I was not able to reproduce the derivative, however, for the vectors filled with 0s and 1s.

@JimChengLin
Copy link

I am not sure about that either. The derivative is stated on page 173 and given that the code makes sense. I was not able to reproduce the derivative, however, for the vectors filled with 0s and 1s.

How could temp = (output - true) and output = temp/len(true) become 1/(batch_size * layer_2.shape[0]). BTW, what is the term true?

@JimChengLin
Copy link

The book is great besides those confusing code fragments. I am mainly an infrastructure engineer so I just skip anything I cannot understand then focus on something I could figure out. It's kinda glad to talk someone else who also stuck part of the book.

@Unilmalas
Copy link

I like it too, overall, despite the bothersome issues, gave it a good Amazon review. true is the truth, I assume, i.e. the value we are training against. Poor choice, agreed, that is even a reserved word in Python. Well, this is what I don't get either, I can the length (this comes from the exponentials in the softmax raised to 0 = 1. But the exponents for the 1s would be e and so really not sure there). layer_2.shape[0] also gives the length along axis 0, so that is fine. Honestly I did not go into all the details in the MINST example, had that solved before. I have mastered DL largely and for issues I can't solve in reasonable time I find my own solution (and its been a while I have finished this book, now brushing up on functional thinking).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants