This repo, inspired by the influential reading list recommended by Ilya Sutskever to John Carmack in 2020, curates and reproduces the essential AI and deep learning papers that shaped the field. Dubbed as the "ilya30u30," this collection represents the 27 papers that provide a comprehensive foundation and advanced insights into neural networks, generative models, optimization, and more. Each paper is researched and reproduced, offering detailed notes and code implementations aimed at deepening the understanding of key concepts for researchers, students, and practitioners alike. Through reproduction, I aim to deepen my understanding of the implementation intricacies, validate the findings, and explore potential improvements. By sharing this work, I hope to provide valuable insights and foster further exploration in the deep learning community.
The collection is organized to guide readers from foundational theories to advanced topics. Each entry includes a brief summary and a link to the original paper or resource. The suggested reading order at the end can help you navigate through the materials effectively.
First, use the Reading Order section to understand the knowledge dependencies between the 27 papers. Then, explore the Reproduction Research section to delve deeper into the detailed studies, including reproductions and summaries of each paper.
- CS231n Convolutional Neural Networks for Visual Recognition
- A comprehensive course from Stanford University that covers the basics of CNNs, architectures, and training techniques—a great starting point for learning visual recognition.
- The Unreasonable Effectiveness of Recurrent Neural Networks
- Emphasizes the powerful capabilities of RNNs in handling sequential data, further solidifying your understanding of RNNs.
- Understanding LSTM Networks
- Introduces LSTMs and their advantages in handling long-term dependencies, forming a foundation for understanding RNNs.
- Recurrent Neural Network Regularization
- Learn how to optimize RNNs and LSTMs by reducing overfitting through regularization techniques.
- ImageNet Classification with Deep Convolutional Neural Networks
- Understand the basics of Convolutional Neural Networks (CNNs) and their applications in image recognition.
- Deep Residual Learning for Image Recognition
- Learn how ResNet addresses training challenges in deep networks and enhances accuracy.
- Multi-Scale Context Aggregation by Dilated Convolutions
- Delve into using dilated convolutions in semantic segmentation tasks to achieve multi-scale context aggregation.
- Neural Machine Translation by Jointly Learning to Align and Translate
- Study the foundational models and alignment mechanisms in neural machine translation.
- Attention Is All You Need and The Annotated Transformer
- Deeply understand the Transformer model's implementation details and code through practical explanations.
- Pointer Networks
- Learn about a new network structure that addresses variable-length output sequences, suitable for tasks like sorting and combinatorial optimization.
- Scaling Laws for Neural Language Models
- Explore factors affecting the performance of large language models and their scaling laws.
- GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism
- Learn how to scale neural network capacity through micro-batch pipeline parallelism and understand parallel training techniques for large models.
- Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
- Understand how to build and optimize end-to-end speech recognition systems, especially for handling different languages.
- Neural Message Passing for Quantum Chemistry
- Study the application of graph neural networks in quantum chemistry, a key to understanding supervised learning on graph data structures.
- A Simple Neural Network Module for Relational Reasoning
- Learn how to enhance existing neural networks with relational reasoning modules for unstructured input.
- Relational Recurrent Neural Networks
- Explore new memory modules that perform relational reasoning in sequential data, combining memory and reasoning.
- Variational Lossy Autoencoder
- Explore the combination of autoregressive models and variational autoencoders to achieve complex generative tasks.
- The First Law of Complexodynamics
- Understand the evolution of complexity in physical systems and its relation to Kolmogorov complexity.
- Kolmogorov Complexity and Algorithmic Randomness
- Learn Kolmogorov complexity theory and its applications in algorithmic randomness, laying the foundation for unsupervised learning.
- Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton
- Delve into the quantitative measurement of complexity in closed systems and understand the trend of complexity over time.
- Neural Turing Machines
- Combine neural networks with the concept of Turing machines to expand your understanding of neural network architectures.
- Identity Mappings in Deep Residual Networks
- Further study the internal mechanisms of ResNet and the importance of identity mappings in information propagation.
- Order Matters: Sequence to Sequence for Sets
- Explore how to handle cases where inputs and outputs are not necessarily ordered sequences.
- Keeping Neural Networks Simple by Minimizing the Description Length of the Weights
- Learn how the Minimum Description Length principle applies to simplifying neural networks and balancing weight information content.
- A Tutorial Introduction to the Minimum Description Length Principle
- Deepen your understanding of the MDL principle and its applications in model selection and data compression.
- Machine Super Intelligence
- Explore the concepts and research progress in machine superintelligence, understanding the definition of intelligence and its quantifiable standards.
Note: The reproduction of the code and detailed notes for each paper are currently a work in progress. These sections will be updated in the
main
branch once finalized. Until then, you can find the basic information about each paper and follow our progress through ongoing updates.
Field | Name | Summary | Note | Reproduction |
---|---|---|---|---|
Foundational Theory and Neural Networks | CS231n Convolutional Neural Networks for Visual Recognition | A comprehensive course from Stanford University that covers the basics of CNNs, architectures, and training techniques—a great starting point for learning visual recognition. | note | code |
Foundational Theory and Neural Networks | The Unreasonable Effectiveness of Recurrent Neural Networks | Emphasizes the powerful capabilities of RNNs in handling sequential data, further solidifying your understanding of RNNs. | note | code |
Foundational Theory and Neural Networks | Understanding LSTM Networks | Introduces LSTMs and their advantages in handling long-term dependencies, forming a foundation for understanding RNNs. | note | code |
Foundational Theory and Neural Networks | Recurrent Neural Network Regularization | Learn how to optimize RNNs and LSTMs by reducing overfitting through regularization techniques. | note | code |
Foundational Theory and Neural Networks | ImageNet Classification with Deep Convolutional Neural Networks | Understand the basics of Convolutional Neural Networks (CNNs) and their applications in image recognition. | note | code |
Foundational Theory and Neural Networks | Deep Residual Learning for Image Recognition | Learn how ResNet addresses training challenges in deep networks and enhances accuracy. | note | code |
Foundational Theory and Neural Networks | Multi-Scale Context Aggregation by Dilated Convolutions | Delve into using dilated convolutions in semantic segmentation tasks to achieve multi-scale context aggregation. | note | code |
Machine Translation and NLP | Neural Machine Translation by Jointly Learning to Align and Translate | Study the foundational models and alignment mechanisms in neural machine translation. | note | code |
Machine Translation and NLP | Attention Is All You Need | No explanation needed—the seminal Transformer paper that is a must-read. | note ✅ | code |
Machine Translation and NLP | The Annotated Transformer | Deeply understand the Transformer model's implementation details and code through practical explanations. | note | code |
Machine Translation and NLP | Pointer Networks | Learn about a new network structure that addresses variable-length output sequences, suitable for tasks like sorting and combinatorial optimization. | note | code |
Machine Translation and NLP | Scaling Laws for Neural Language Models | Explore factors affecting the performance of large language models and their scaling laws. | note | code |
Deep Learning and Optimization | GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism | Learn how to scale neural network capacity through micro-batch pipeline parallelism and understand parallel training techniques for large models. | note | code |
Deep Learning and Optimization | Deep Speech 2: End-to-End Speech Recognition in English and Mandarin | Understand how to build and optimize end-to-end speech recognition systems, especially for handling different languages. | note | code |
Graph Structures and Relational Reasoning | Neural Message Passing for Quantum Chemistry | Study the application of graph neural networks in quantum chemistry, a key to understanding supervised learning on graph data structures. | note | code |
Graph Structures and Relational Reasoning | A Simple Neural Network Module for Relational Reasoning | Learn how to enhance existing neural networks with relational reasoning modules for unstructured input. | note | code |
Graph Structures and Relational Reasoning | Relational Recurrent Neural Networks | Explore new memory modules that perform relational reasoning in sequential data, combining memory and reasoning. | note | code |
Generative Models and Complexity | Variational Lossy Autoencoder | Explore the combination of autoregressive models and variational autoencoders to achieve complex generative tasks. | note | code |
Generative Models and Complexity | The First Law of Complexodynamics | Understand the evolution of complexity in physical systems and its relation to Kolmogorov complexity. | note | code |
Generative Models and Complexity | Kolmogorov Complexity and Algorithmic Randomness | Learn Kolmogorov complexity theory and its applications in algorithmic randomness, laying the foundation for unsupervised learning. | note | code |
Generative Models and Complexity | Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton | Delve into the quantitative measurement of complexity in closed systems and understand the trend of complexity over time. | note | code |
Advanced Topics | Neural Turing Machines | Combine neural networks with the concept of Turing machines to expand your understanding of neural network architectures. | note | code |
Advanced Topics | Identity Mappings in Deep Residual Networks | Further study the internal mechanisms of ResNet and the importance of identity mappings in information propagation. | note | code |
Advanced Topics | Order Matters: Sequence to Sequence for Sets | Explore how to handle cases where inputs and outputs are not necessarily ordered sequences. | note | code |
Advanced Topics | Keeping Neural Networks Simple by Minimizing the Description Length of the Weights | Learn how the Minimum Description Length principle applies to simplifying neural networks and balancing weight information content. | note | code |
Advanced Topics | A Tutorial Introduction to the Minimum Description Length Principle | Deepen your understanding of the MDL principle and its applications in model selection and data compression. | note | code |
Advanced Topics | Machine Super Intelligence | Explore the concepts and research progress in machine superintelligence, understanding the definition of intelligence and its quantifiable standards. | note | code |
No explanation needed—the seminal Transformer paper that is a must-read.
Link: https://arxiv.org/pdf/1706.03762
This article is a blog post written in 2018 by researchers including Alexander Rush, an associate professor at Cornell University. It provides a line-by-line explanation of the Transformer and includes a complete Python implementation. This helps readers understand the theory while deepening their knowledge through practice.
Article: https://nlp.seas.harvard.edu/2018/04/03/attention.html
Code: https://github.com/harvardnlp/annotated-transformer/
This is an article titled "The First Law of Complexodynamics" by Scott Aaronson, discussing why the "complexity" or "interestingness" of physical systems seems to increase over time, reach a maximum, and then decrease, while entropy increases monotonically. Aaronson attempts to explain this phenomenon using Kolmogorov complexity and related concepts, pointing out several challenges and possible solutions in this field.
Article: https://scottaaronson.blog/?p=762
This article, written by Andrej Karpathy in 2015, emphasizes the effectiveness of Recurrent Neural Networks (RNNs). It explores the powerful capabilities of RNNs in handling sequential data.
Link: https://karpathy.github.io/2015/05/21/rnn-effectiveness/
Written in 2015 by Christopher Olah, this article introduces Long Short-Term Memory (LSTM) networks, a special kind of RNN capable of handling long-term dependencies. LSTMs have achieved great success in fields like speech recognition, language modeling, translation, and image captioning.
Link: https://colah.github.io/posts/2015-08-Understanding-LSTMs/
Authored by Ilya Sutskever in 2015, this paper proposes a simple regularization technique for Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) units. The paper demonstrates how to correctly apply dropout to LSTM networks, significantly reducing overfitting across various tasks, including language modeling, speech recognition, image caption generation, and machine translation.
Link: https://arxiv.org/pdf/1409.2329.pdf
This paper discusses how the generalization ability of supervised neural networks improves when the weights contain less information than the output vectors of the training cases. By penalizing the information content of the weights during learning, the weights are kept simple. This can be achieved by adding Gaussian noise to control the information content of the weights.
Link: https://www.cs.toronto.edu/~hinton/absps/colt93.pdf
This paper introduces a new neural network architecture designed to learn the conditional probability of output sequences composed of discrete tokens that represent positions in an input sequence. Existing models like sequence-to-sequence and Neural Turing Machines struggle with problems where the target output dictionary size depends on the input length, such as sorting variable-length sequences and various combinatorial optimization problems.
Link: https://arxiv.org/pdf/1506.03134
Authored by Geoffrey Hinton, Ilya Sutskever, and others, this groundbreaking paper introduced AlexNet, revolutionizing image recognition and kickstarting the deep learning revolution. They trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into 1000 different classes.
Link: https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
The paper explores how the order in which data is organized affects the learning of underlying patterns. The authors investigate an extension of the sequence-to-sequence (seq2seq) framework to go beyond sequences and handle input sets in a principled way. They also propose a loss function to address the lack of structure in output sets by exploring different data sequences during training.
Link: https://arxiv.org/pdf/1511.06391
This paper introduces GPipe, a model parallelism library that allows scaling the capacity of large neural networks through micro-batch pipeline parallelism. The authors demonstrate its application in image classification and multilingual neural machine translation tasks.
Link: https://arxiv.org/pdf/1811.06965
Authored by Kaiming He, this 2016 CVPR Best Paper describes the Deep Residual Learning framework, significantly reducing the difficulty of training deeper neural networks and improving accuracy.
Link: https://arxiv.org/pdf/1512.03385
The authors develop a novel convolutional network module tailored for dense prediction tasks like semantic segmentation. This module uses dilated convolutions to effectively aggregate multi-scale contextual information without reducing image resolution.
Link: https://arxiv.org/pdf/1511.07122
The paper summarizes and organizes existing neural network models for graph-structured data that the authors believe are most promising. It proposes a general framework for supervised learning on graphs called Message Passing Neural Networks (MPNNs).
Link: https://arxiv.org/pdf/1704.01212
Published in 2014 by Bahdanau et al., this paper is one of the pioneering works in neural machine translation. It introduces a novel model architecture and training method that allows neural networks to automatically search for parts of a source sentence that are relevant to predicting a target word.
Link: https://arxiv.org/pdf/1409.0473
This paper, also authored by Kaiming He, further analyzes the propagation mechanisms behind residual blocks. The authors propose a new residual unit that simplifies the training process and improves model generalization.
Link: https://arxiv.org/pdf/1603.05027
To explore relational reasoning further and test whether this capability can be easily added to existing systems, DeepMind researchers developed a simple, plug-and-play module called the Relation Network (RN). This module can be inserted into existing neural network architectures to equip them with the ability to reason about relationships between entities.
Link: https://arxiv.org/pdf/1706.01427
This paper successfully combines autoregressive models with Variational Autoencoders (VAEs) to achieve generative tasks. It addresses the issue where VAEs tend to ignore some latent representations during training and introduces the Variational Lossy Autoencoder (VLAE).
Link: https://arxiv.org/pdf/1611.02731
This paper from DeepMind and University College London introduces the Relational Memory Core (RMC), capable of performing relational reasoning in sequential information. It achieves state-of-the-art performance on the WikiText-103, Project Gutenberg, and GigaWord datasets.
Link: https://arxiv.org/pdf/1806.01822
This paper attempts to measure the pattern where the "complexity" or "interestingness" of closed systems increases over time, reaches a maximum, and then decreases, unlike entropy, which increases monotonically. The authors use a simple two-dimensional cellular automaton model to simulate the mixing of two liquids ("coffee" and "cream") and propose "structural complexity" as an approximate measure of Kolmogorov complexity.
Link: https://arxiv.org/pdf/1405.6903
Further Reading: Beauty and Structural Complexity (in Chinese)
Neural Turing Machines (NTMs) are a deep learning algorithm that combines neural networks and the concept of Turing machines. The paper enhances the capabilities of neural networks by coupling them to external memory resources, with which they can interact using attention mechanisms.
Link: https://arxiv.org/pdf/1410.5401
Published by Baidu Research's Silicon Valley AI Lab, the authors demonstrate an end-to-end deep learning approach that can recognize English and Mandarin speech. They replace hand-engineered components with neural networks, handling various speech scenarios, including noisy environments and different accents.
Link: https://arxiv.org/pdf/1512.02595.pdf
A classic paper from OpenAI, the authors explore the factors that affect language model performance in terms of cross-entropy loss. They find that model size, dataset size, and training compute affect the loss and can be largely traded off against each other.
Link: https://arxiv.org/pdf/2001.08361
This paper provides a tutorial introduction to the Minimum Description Length (MDL) principle, a method for model selection and data compression.
Link: https://arxiv.org/pdf/math/0406077
Authored by DeepMind co-founder and chief scientist Shane Legg, this 2008 doctoral thesis is considered one of the earliest academic papers to systematically explore Artificial General Intelligence (AGI). It lays the foundation for subsequent research in the field.
Link: https://www.vetta.org/documents/Machine_Super_Intelligence.pdf
Published by the American Mathematical Society, this book by A. Shen, V. A. Uspenskii, and N. K. Vereshchagin introduces Kolmogorov complexity theory and its applications in algorithmic randomness, providing a theoretical foundation for understanding computational complexity and randomness.
Link: https://www.lirmm.fr/~ashen/kolmbook-eng-scan.pdf
CS231n is a machine learning course at Stanford University, focusing on using Convolutional Neural Networks for visual recognition. It comprehensively covers CNN architectures, training techniques, and the latest research findings.
Link: https://cs231n.github.io/
Ref: Exclusive Q&A: John Carmack’s ‘Different Path’ to Artificial General Intelligence
"So I asked Ilya Sutskever, OpenAI’s chief scientist, for a reading list. He gave me a list of like 40 research papers and said, ‘If you really learn all of these, you’ll know 90% of what matters today.’ And I did. I plowed through all those things and it all started sorting out in my head."
Ref: https://x.com/ID_AA_Carmack/status/1622673143469858816
I rather expected @ilyasut to have made a public post by now after all the discussion of the AI reading list he gave me. A canonical list of references from a leading figure would be appreciated by many. I would be curious myself about what he would add from the last three years.