Skip to content

Commit

Permalink
fix silly typos
Browse files Browse the repository at this point in the history
  • Loading branch information
karpathy committed Mar 16, 2022
1 parent 08dc797 commit 13030ab
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions _posts/2022-03-14-lecun1989.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The Yann LeCun et al. (1989) paper [Backpropagation Applied to Handwritten Zip C

**Implementation**. I tried to follow the paper as close as possible and re-implemented everything in PyTorch in this [karpathy/lecun1989-repro](https://github.com/karpathy/lecun1989-repro) github repo. The original network was implemented in Lisp using the Bottou and LeCun 1988 [backpropagation simulator SN](https://leon.bottou.org/papers/bottou-lecun-88) (later named Lush). The paper is in french so I can't super read it, but from the syntax it looks like you can specify neural nets using higher-level API similar to what you'd do in something like PyTorch today. As a quick note on software design, modern libraries have adopted a design that splits into 3 components: 1) a fast (C/CUDA) general Tensor library that implements basic mathematical operations over multi-dimensional tensors, and 2) an autograd engine that tracks the forward compute graph and can generate operations for the backward pass, and 3) a scriptable (Python) deep-learning-aware, high-level API of common deep learning operations, layers, architectures, optimizers, loss functions, etc.

**Training**. During the course of training we have to make 23 passes over the training set of 7291 examples, for a total of 167,693 presentations of (example, label) to the neural network. The original network trained for 3 days on a [SUN-4/260](https://en.wikipedia.org/wiki/Sun-4) workstation. I ran my implementation on my MacBook Air (M1) CPU, which crunched through it in about 90 seconds (~**3000X naive speedup**). My conda is setup to use the native amd64 builds, rather than Rosetta emulation. The speedup may have been more dramatic if PyTorch had support for the full capability of the M1 (including the GPU and the NPU), but this seems to still be in development. I also tried naively running the code on an A100 GPU, but the training was actually *slower*, most likely because the network is so tiny (4 layer convnet with up to 12 channels, total of 9760 params, 64K MACs, 1K activations), and the SGD uses only a single example a time. That said, if one really wanted to crush this problem with modern hardware (A100) and software infrastructure (CUDA, PyTorch), we'd need to trade per-example SGD for full-batch training to maximize GPU utilization and most likely achieve another ~100X speedup of training latency.
**Training**. During the course of training we have to make 23 passes over the training set of 7291 examples, for a total of 167,693 presentations of (example, label) to the neural network. The original network trained for 3 days on a [SUN-4/260](https://en.wikipedia.org/wiki/Sun-4) workstation. I ran my implementation on my MacBook Air (M1) CPU, which crunched through it in about 90 seconds (~**3000X naive speedup**). My conda is setup to use the native arm64 builds, rather than Rosetta emulation. The speedup may have been more dramatic if PyTorch had support for the full capability of the M1 (including the GPU and the NPU), but this seems to still be in development. I also tried naively running the code on an A100 GPU, but the training was actually *slower*, most likely because the network is so tiny (4 layer convnet with up to 12 channels, total of 9760 params, 64K MACs, 1K activations), and the SGD uses only a single example at a time. That said, if one really wanted to crush this problem with modern hardware (A100) and software infrastructure (CUDA, PyTorch), we'd need to trade per-example SGD for full-batch training to maximize GPU utilization and most likely achieve another ~100X speedup of training latency.

**Reproducing 1989 performance**. The original paper reports the following results:

Expand All @@ -41,7 +41,7 @@ eval: split train. loss 4.073383e-03. error 0.62%. misses: 45
eval: split test . loss 2.838382e-02. error 4.09%. misses: 82
```

So I am reproducing the numbers *roughly*, but not exactly. Sadly, an exact reproduction is most likely not possible because the original dataset has, I believe, been lost to time. Instead, I had to simulate it using the larger MNIST dataset (hah never thought I'd say that) by taking its 28x28 digits, scaling them down to 16x16 pixels with bilinear interpolation, and randomly without replacement drawing the correct number of training and test set examples from it. But I am sure there are other culprits at play. For example, the paper is a bit too abstract in its description of the weight initialization scheme, and I suspect that there are some formatting errors in the pdf file that, for example, erase dots ".", making "2.5" look like like "2 5", and potentially (I think?) erasing square roots. E.g. we're told that the weight init is drawn from uniform "2 4 / F" where F is the fan-in, but I am guessing this surely (?) means "2.4 / sqrt(F)", where the sqrt helps preserve the standard deviation of outputs. The specific sparse connectivity structure between the H1 and H2 layers of the net are also brushed over, the paper just says it is "chosen according to a scheme that will not be dicussed here", so I had to make some some sensible guessses here with an overlapping block sparse structure. The paper also claims to use tanh non-linearity, but I am worried this may have actually been the "normalized tanh" that maps ntanh(1) = 1, and potentially with an added scaled-down skip connection, which was trendy at the time to ensure there is at least a bit of gradient in the flat tails of the tanh. Lastly, the paper uses a "special version of Newton's algorithm that uses a positive, diagonal approximation of Hessian", but I only used SGD because it is signficiantly simpler and, according to the paper, "this algorithm is not believed to bring a tremendous increase in learning speed".
So I am reproducing the numbers *roughly*, but not exactly. Sadly, an exact reproduction is most likely not possible because the original dataset has, I believe, been lost to time. Instead, I had to simulate it using the larger MNIST dataset (hah never thought I'd say that) by taking its 28x28 digits, scaling them down to 16x16 pixels with bilinear interpolation, and randomly without replacement drawing the correct number of training and test set examples from it. But I am sure there are other culprits at play. For example, the paper is a bit too abstract in its description of the weight initialization scheme, and I suspect that there are some formatting errors in the pdf file that, for example, erase dots ".", making "2.5" look like like "2 5", and potentially (I think?) erasing square roots. E.g. we're told that the weight init is drawn from uniform "2 4 / F" where F is the fan-in, but I am guessing this surely (?) means "2.4 / sqrt(F)", where the sqrt helps preserve the standard deviation of outputs. The specific sparse connectivity structure between the H1 and H2 layers of the net are also brushed over, the paper just says it is "chosen according to a scheme that will not be discussed here", so I had to make some some sensible guesses here with an overlapping block sparse structure. The paper also claims to use tanh non-linearity, but I am worried this may have actually been the "normalized tanh" that maps ntanh(1) = 1, and potentially with an added scaled-down skip connection, which was trendy at the time to ensure there is at least a bit of gradient in the flat tails of the tanh. Lastly, the paper uses a "special version of Newton's algorithm that uses a positive, diagonal approximation of Hessian", but I only used SGD because it is significantly simpler and, according to the paper, "this algorithm is not believed to bring a tremendous increase in learning speed".

**Cheating with time travel**. Around this point came my favorite part. We are living here 33 years in the future and deep learning is a highly active area of research. How much can we improve on the original result using our modern understanding and 33 years of R&D? My original result was:

Expand Down Expand Up @@ -116,7 +116,7 @@ In summary, simply scaling up the dataset in 1989 would have been an effective w
- A state of the art classifier that took 3 days to train on a workstation now trains in 90 seconds on my fanless laptop (3,000X naive speedup), and further ~100X gains are very likely possible by switching to full-batch optimization and utilizing a GPU.
- I was, in fact, able to tune the model, augmentation, loss function, and the optimization based on modern R&D innovations to cut down the error rate by 60%, while keeping the dataset and the test-time latency of the model unchanged.
- Modest gains were attainable just by scaling up the dataset alone.
- Further signficant gains would likely have to come from a larger model, which would require more compute, and additional R&D to help stabilize the training at increasing scales. In particular, if I was transported to 1989, I would have ultimately become upper-bounded in my ability to further improve the system without a bigger computer.
- Further significant gains would likely have to come from a larger model, which would require more compute, and additional R&D to help stabilize the training at increasing scales. In particular, if I was transported to 1989, I would have ultimately become upper-bounded in my ability to further improve the system without a bigger computer.

Suppose that the lessons of this exercise remain invariant in time. What does that imply about deep learning of 2022? What would a time traveler from 2055 think about the performance of current networks?

Expand Down

0 comments on commit 13030ab

Please sign in to comment.