ann_class2/extra_reading.txt

The Marginal Value of Adaptive Gradient Methods in Machine Learning
https://arxiv.org/abs/1705.08292

Asynchronous Stochastic Gradient Descent with Delay Compensation for Distributed Deep Learning
https://arxiv.org/abs/1609.08326

Asynchronous Stochastic Gradient Descent with Variance Reduction for Non-Convex Optimization
https://arxiv.org/abs/1604.03584

Large Scale Distributed Deep Networks
https://static.googleusercontent.com/media/research.google.com/en//archive/large_deep_networks_nips2012.pdf

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe, Christian Szegedy
https://arxiv.org/abs/1502.03167

Xavier (Glorot) Normal Initializer
http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf

He Normal Initializer
http://arxiv.org/abs/1502.01852

For understanding Nesterov Momentum:
Advances in optimizing Recurrent Networks by Yoshua Bengio, Section 3.5
http://arxiv.org/pdf/1212.0901v2.pdf

Dropout: A Simple Way to Prevent Neural Networks from Overfitting
https://www.cs.toronto.edu/~hinton/absps/JMLRdropout.pdf