Some comments on the "Bonus" section of this article from AI Summer.
Edit: The article has been updated with my demonstration (link).
-
"When
$\hat{y}^{(i)} = 1$ " and "When$\hat{y}^{(i)} = 0$ " are inverted. -
I have no idea why the authors replace
$y^{(i)}$ with$\sigma(\theta^\intercal x)$ or$1-\sigma(\theta^\intercal x)$ when the class changes. If properly trained, the model weights should "push" the sigmoid to output 0 or 1 depending on the input$x$ . -
The proposed demonstration does not actually prove anything:
-
When
$y^{(i)} = 0$ , negative class:$$MSE = \frac{1}{m}\sum_{i}^{m}{{\lVert -\hat{y}^{(i)} \rVert}^2} = \frac{1}{m}\sum_{i}^{m}{{\lVert \sigma(\theta^\intercal x) \rVert}^2}$$ -
When
$y^{(i)} = 1$ , positive class:$$MSE = \frac{1}{m}\sum_{i}^{m}{{\lVert 1-\hat{y}^{(i)} \rVert}^2} = \frac{1}{m}\sum_{i}^{m}{{\lVert 1- \sigma(\theta^\intercal x) \rVert}^2}$$
-
Let's assume that we have a simple neural network with weights
The chain rule gives us the gradient of the loss
MSE loss is expressed as follows:
Thus, the gradient with respect to
We can see that
When we try with a BCE loss:
For
If the network is right and predicted the negative class,
For
If the network is right,